New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: depthwise separable convolution #1708

Closed
hyqneuron opened this Issue Jun 3, 2017 · 46 comments

Comments

@hyqneuron

hyqneuron commented Jun 3, 2017

I don't see an implementation for depthwise separable convolution. Currently it is possible with Conv2d by setting groups=out_channels. However this is painstakingly slow. See benchmark at bottom. We need an efficient implementation for this.

I realize torch7's SpatialDepthWiseConvolution is still slower. However TF seems to have a slightly optimized implementation, so their depthwise conv is about 3x-8x faster than the normal conv (comparing 3x3 conv with 3x3 depthwise conv ONLY, without pointwise conv), but still slow.

Benchmark comparing time (in seconds) of different group size (groups=1,2,4,256). We can see that for small number of groups we get reasonable speedup. For large number of groups it gets much slower instead.

Sequential (
  (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (9): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
testing model of weight size: torch.Size([256, 256, 3, 3])
3.65793013573
Sequential (
  (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
  (9): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
)
testing model of weight size: torch.Size([256, 128, 3, 3])
1.99519991875
Sequential (
  (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
  (9): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
)
testing model of weight size: torch.Size([256, 64, 3, 3])
1.34896993637
Sequential (
  (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
  (9): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
)
testing model of weight size: torch.Size([256, 1, 3, 3])
5.64783811569

@jekbradbury

This comment has been minimized.

Show comment
Hide comment
@jekbradbury

jekbradbury Jun 12, 2017

Contributor

As this underlies https://arxiv.org/abs/1706.03059 the demand for depthwise separable convs will only continue to grow. I believe this is largely a case of cuDNN not providing an optimized implementation; perhaps a new THCUNN kernel is in order -- maybe first for 1D, which should be less complex?

Chainer has a fairly simple implementation here which we should maybe port until there’s an optimized kernel.

Contributor

jekbradbury commented Jun 12, 2017

As this underlies https://arxiv.org/abs/1706.03059 the demand for depthwise separable convs will only continue to grow. I believe this is largely a case of cuDNN not providing an optimized implementation; perhaps a new THCUNN kernel is in order -- maybe first for 1D, which should be less complex?

Chainer has a fairly simple implementation here which we should maybe port until there’s an optimized kernel.

@fmassa

This comment has been minimized.

Show comment
Hide comment
@fmassa

fmassa Jun 12, 2017

Member

For the record, there is an implementation in THNN/THCUNN of SpatialDepthWiseConvolution which are mostly performing a for loop over the number of groups, so it shouldn't be any more efficient than our current implementation using groups=nInputPlane. But we could maybe modify it to use bmm instead of mm, as in the Chainer example?

Member

fmassa commented Jun 12, 2017

For the record, there is an implementation in THNN/THCUNN of SpatialDepthWiseConvolution which are mostly performing a for loop over the number of groups, so it shouldn't be any more efficient than our current implementation using groups=nInputPlane. But we could maybe modify it to use bmm instead of mm, as in the Chainer example?

@wetliu wetliu referenced this issue Jun 21, 2017

Closed

Benchmark on GPU #24

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Jul 21, 2017

Member

making this high pri, looks like demand for this is off the charts.

Member

soumith commented Jul 21, 2017

making this high pri, looks like demand for this is off the charts.

@szagoruyko

This comment has been minimized.

Show comment
Hide comment
@szagoruyko

szagoruyko Aug 2, 2017

Contributor

ported caffe depthwise conv2d here https://github.com/szagoruyko/pyinn (needs CuPy)

Contributor

szagoruyko commented Aug 2, 2017

ported caffe depthwise conv2d here https://github.com/szagoruyko/pyinn (needs CuPy)

@futurely

This comment has been minimized.

Show comment
Hide comment
@futurely

futurely Aug 15, 2017

cudnn 7 grouped convolutions and cuda 9 hgemm fix
3522d3a

futurely commented Aug 15, 2017

cudnn 7 grouped convolutions and cuda 9 hgemm fix
3522d3a

@futurely futurely referenced this issue Aug 15, 2017

Merged

Cuda9 updates #2263

@soumith soumith added this to High Priority in Issue Status Aug 23, 2017

@soumith soumith moved this from High Priority to High Priority_ in Issue Status Aug 23, 2017

@amdcat

This comment has been minimized.

Show comment
Hide comment
@amdcat

amdcat Aug 29, 2017

Anything new about this problem?
depthwise convolution is still slow in my pytorch environment

amdcat commented Aug 29, 2017

Anything new about this problem?
depthwise convolution is still slow in my pytorch environment

@WendyShang

This comment has been minimized.

Show comment
Hide comment
@WendyShang

WendyShang Aug 29, 2017

Hi, based on this paper (https://arxiv.org/pdf/1608.04337.pdf, I think this is one of the earliest discovery of such separable convolution but somehow was ignored), we can benefit by adding flexibility in separating channels into non-overlapping subgroups and perform separate convolutions on each separate subgroup, thus not necessarily view each channel separately and perform convolution on each individual depth only. There are similar walkarounds to achieve such design as depth convolution, though also slow.

This adds more flexibility in architecture design and it would be great if PyTorch may consider a more flexible version than the original Torch depthwise separable convolution :)

WendyShang commented Aug 29, 2017

Hi, based on this paper (https://arxiv.org/pdf/1608.04337.pdf, I think this is one of the earliest discovery of such separable convolution but somehow was ignored), we can benefit by adding flexibility in separating channels into non-overlapping subgroups and perform separate convolutions on each separate subgroup, thus not necessarily view each channel separately and perform convolution on each individual depth only. There are similar walkarounds to achieve such design as depth convolution, though also slow.

This adds more flexibility in architecture design and it would be great if PyTorch may consider a more flexible version than the original Torch depthwise separable convolution :)

@qianguih

This comment has been minimized.

Show comment
Hide comment
@qianguih

qianguih Aug 30, 2017

Hi,

Just want to check the status of this thread. I tested the latest version of pytorch. Looks like there is no update about separable convolution yet. Please correct me if I'm wrong. : )

qianguih commented Aug 30, 2017

Hi,

Just want to check the status of this thread. I tested the latest version of pytorch. Looks like there is no update about separable convolution yet. Please correct me if I'm wrong. : )

@soumith soumith added this to nn / autograd / torch in Issue Categories Sep 11, 2017

@rkaplan

This comment has been minimized.

Show comment
Hide comment
@rkaplan

rkaplan Oct 2, 2017

Looking forward to this feature being implemented!

rkaplan commented Oct 2, 2017

Looking forward to this feature being implemented!

@killeent killeent referenced this issue Oct 10, 2017

Merged

Spatial Depthwise Convolution on the GPU #3057

6 of 6 tasks complete
@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Oct 18, 2017

Member

this is now added to master via #3057

Member

soumith commented Oct 18, 2017

this is now added to master via #3057

@soumith soumith closed this Oct 18, 2017

@qianguih

This comment has been minimized.

Show comment
Hide comment
@qianguih

qianguih Oct 22, 2017

Hi,
Thanks for your amazing work on this! Just updated my pytorch for a faster depthwise convolution. However, I coudn't find documentations about how to call the new depthwise conv function. Is there any example for this?

qianguih commented Oct 22, 2017

Hi,
Thanks for your amazing work on this! Just updated my pytorch for a faster depthwise convolution. However, I coudn't find documentations about how to call the new depthwise conv function. Is there any example for this?

@colesbury

This comment has been minimized.

Show comment
Hide comment
@colesbury

colesbury Oct 23, 2017

Member

@qianguih use groups=in_channels=out_channels e.g.:

m = nn.Conv2d(128, 128, kernel_size=3, groups=128).cuda()
Member

colesbury commented Oct 23, 2017

@qianguih use groups=in_channels=out_channels e.g.:

m = nn.Conv2d(128, 128, kernel_size=3, groups=128).cuda()
@qianguih

This comment has been minimized.

Show comment
Hide comment
@qianguih

qianguih Oct 23, 2017

@colesbury I see. Thank you!

qianguih commented Oct 23, 2017

@colesbury I see. Thank you!

@elliothe

This comment has been minimized.

Show comment
Hide comment
@elliothe

elliothe Oct 23, 2017

Since the pytorch and torch share the same backend THCUNN file, could I make this spatialdepthwiseconvolution function into Torch? since the orginal function in torch has GPU memory leak problem.

elliothe commented Oct 23, 2017

Since the pytorch and torch share the same backend THCUNN file, could I make this spatialdepthwiseconvolution function into Torch? since the orginal function in torch has GPU memory leak problem.

@killeent

This comment has been minimized.

Show comment
Hide comment
@killeent

killeent Oct 23, 2017

Contributor

@qianguih note that we do support having a depthwise multiplier, so groups=in_channels must be true, but out_channels can be any multiple of in_channels, e.g.:

m = nn.Conv2d(128, 256, kernel_size=3, groups=128).cuda()

is also valid.

Contributor

killeent commented Oct 23, 2017

@qianguih note that we do support having a depthwise multiplier, so groups=in_channels must be true, but out_channels can be any multiple of in_channels, e.g.:

m = nn.Conv2d(128, 256, kernel_size=3, groups=128).cuda()

is also valid.

@killeent

This comment has been minimized.

Show comment
Hide comment
@killeent

killeent Oct 23, 2017

Contributor

@elliothe one potential issue you will likely run into is that the existing SpatialDepthwiseConvolution in LuaTorch (the backing implementations are actually removed in this PR) is that the Lua layers have a differing format, see e.g. my thrown away PR here: torch/nn#1277. So you would have to handle some of these differences.

Contributor

killeent commented Oct 23, 2017

@elliothe one potential issue you will likely run into is that the existing SpatialDepthwiseConvolution in LuaTorch (the backing implementations are actually removed in this PR) is that the Lua layers have a differing format, see e.g. my thrown away PR here: torch/nn#1277. So you would have to handle some of these differences.

@elliothe

This comment has been minimized.

Show comment
Hide comment
@elliothe

elliothe Oct 23, 2017

@killeent Thanks for your answer! What I did is to remake the torch.cunn with the replaced THCUNN library. Then I receive the following errors when I try to call the SpatialDepthwiseConvolution function:

not found: THNN_CudaSpatialDepthWiseConvolution_updateOutput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaSpatialDepthWiseConvolution_updateOutput	
not found: THNN_CudaSpatialDepthWiseConvolution_updateGradInput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaSpatialDepthWiseConvolution_updateGradInput	
not found: THNN_CudaSpatialDepthWiseConvolution_accGradParameters/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaSpatialDepthWiseConvolution_accGradParameters	
not found: THNN_CudaDoubleSpatialDepthWiseConvolution_updateOutput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaDoubleSpatialDepthWiseConvolution_updateOutput	
not found: THNN_CudaDoubleSpatialDepthWiseConvolution_updateGradInput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaDoubleSpatialDepthWiseConvolution_updateGradInput	
not found: THNN_CudaDoubleSpatialDepthWiseConvolution_accGradParameters/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaDoubleSpatialDepthWiseConvolution_accGradParameters	
not found: THNN_CudaHalfSpatialDepthWiseConvolution_updateOutput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaHalfSpatialDepthWiseConvolution_updateOutput	
not found: THNN_CudaHalfSpatialDepthWiseConvolution_updateGradInput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaHalfSpatialDepthWiseConvolution_updateGradInput	
not found: THNN_CudaHalfSpatialDepthWiseConvolution_accGradParameters/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaHalfSpatialDepthWiseConvolution_accGradParameters	

I am wondering whether the formating problem you mentioned leads to this error.

elliothe commented Oct 23, 2017

@killeent Thanks for your answer! What I did is to remake the torch.cunn with the replaced THCUNN library. Then I receive the following errors when I try to call the SpatialDepthwiseConvolution function:

not found: THNN_CudaSpatialDepthWiseConvolution_updateOutput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaSpatialDepthWiseConvolution_updateOutput	
not found: THNN_CudaSpatialDepthWiseConvolution_updateGradInput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaSpatialDepthWiseConvolution_updateGradInput	
not found: THNN_CudaSpatialDepthWiseConvolution_accGradParameters/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaSpatialDepthWiseConvolution_accGradParameters	
not found: THNN_CudaDoubleSpatialDepthWiseConvolution_updateOutput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaDoubleSpatialDepthWiseConvolution_updateOutput	
not found: THNN_CudaDoubleSpatialDepthWiseConvolution_updateGradInput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaDoubleSpatialDepthWiseConvolution_updateGradInput	
not found: THNN_CudaDoubleSpatialDepthWiseConvolution_accGradParameters/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaDoubleSpatialDepthWiseConvolution_accGradParameters	
not found: THNN_CudaHalfSpatialDepthWiseConvolution_updateOutput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaHalfSpatialDepthWiseConvolution_updateOutput	
not found: THNN_CudaHalfSpatialDepthWiseConvolution_updateGradInput/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaHalfSpatialDepthWiseConvolution_updateGradInput	
not found: THNN_CudaHalfSpatialDepthWiseConvolution_accGradParameters/home/elliot/torch/install/share/lua/5.1/nn/THNN.lua:108: failed to find function/global THNN_CudaHalfSpatialDepthWiseConvolution_accGradParameters	

I am wondering whether the formating problem you mentioned leads to this error.

@KeCh96

This comment has been minimized.

Show comment
Hide comment
@KeCh96

KeCh96 Feb 2, 2018

I have upgrade my pytorch to 0.3.0, but I found m = nn.Conv2d(128, 256, kernel_size=3, groups=128) is still 2 times slower than m = nn.Conv2d(128, 256, kernel_size=3). I am really confused by this problem, should I need to upgrade pytorch to other version? @killeent @qianguih

KeCh96 commented Feb 2, 2018

I have upgrade my pytorch to 0.3.0, but I found m = nn.Conv2d(128, 256, kernel_size=3, groups=128) is still 2 times slower than m = nn.Conv2d(128, 256, kernel_size=3). I am really confused by this problem, should I need to upgrade pytorch to other version? @killeent @qianguih

@KeCh96

This comment has been minimized.

Show comment
Hide comment
@KeCh96

KeCh96 Feb 3, 2018

I am using cuda8.0 . Do I need cuda 9?

KeCh96 commented Feb 3, 2018

I am using cuda8.0 . Do I need cuda 9?

@fmassa

This comment has been minimized.

Show comment
Hide comment
@fmassa

fmassa Feb 3, 2018

Member

@KeCh96 I believe the number of input channels should be equal to the number of output channels for the new codepath to be activated

Member

fmassa commented Feb 3, 2018

@KeCh96 I believe the number of input channels should be equal to the number of output channels for the new codepath to be activated

@ngimel

This comment has been minimized.

Show comment
Hide comment
@ngimel

ngimel Feb 3, 2018

Contributor

New codepath should be activated for these parameters. The reason non-grouped convolution is still faster likely is that a very efficient winograd kernel from cudnn is used for it. Depthwise-separable convolutions are bandwidth-bound, thus never will have as high performance as compute-bound ones, also, the kernels are very simple, not even using shared memory, so that can be optimized too. I also suspect that perf for the different number of input and output channels is worse than for the same, another avenue for optimization.

Contributor

ngimel commented Feb 3, 2018

New codepath should be activated for these parameters. The reason non-grouped convolution is still faster likely is that a very efficient winograd kernel from cudnn is used for it. Depthwise-separable convolutions are bandwidth-bound, thus never will have as high performance as compute-bound ones, also, the kernels are very simple, not even using shared memory, so that can be optimized too. I also suspect that perf for the different number of input and output channels is worse than for the same, another avenue for optimization.

@fmassa

This comment has been minimized.

Show comment
Hide comment
@fmassa

fmassa Feb 3, 2018

Member

@ngimel yes, you are right. I had just looked at the comment in the code and not in the actual code itself. We should probably update that comment.

Member

fmassa commented Feb 3, 2018

@ngimel yes, you are right. I had just looked at the comment in the code and not in the actual code itself. We should probably update that comment.

@KeCh96

This comment has been minimized.

Show comment
Hide comment
@KeCh96

KeCh96 Feb 4, 2018

I have tested m = nn.Conv2d(64, 64, kernel_size=3, groups=64) and m = nn.Conv2d(64, 64, kernel_size=3, groups=1) on both GPU and CPU. In my network, input_channel*K = output_channel, where K is 1 or 2. And I set batchsize=1. The result is as below:

GPU:

group=1 examples_per_sec: 83
group=4 examples_per_sec: 67
group=32 examples_per_sec: 91

CPU:

group=1 examples_per_sec: 80
group=4 examples_per_sec: 81
group=32 examples_per_sec: 17

Depthwise convolution has not lead to any obvious speedup, even slow on CPU

Note: my pytorch is 0.3.0, and anaconda is 3-5.0.1. Both are latest version. My CPU is Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz , and GPU is Tesla K80. My cuda is 8.0

@fmassa @ngimel

KeCh96 commented Feb 4, 2018

I have tested m = nn.Conv2d(64, 64, kernel_size=3, groups=64) and m = nn.Conv2d(64, 64, kernel_size=3, groups=1) on both GPU and CPU. In my network, input_channel*K = output_channel, where K is 1 or 2. And I set batchsize=1. The result is as below:

GPU:

group=1 examples_per_sec: 83
group=4 examples_per_sec: 67
group=32 examples_per_sec: 91

CPU:

group=1 examples_per_sec: 80
group=4 examples_per_sec: 81
group=32 examples_per_sec: 17

Depthwise convolution has not lead to any obvious speedup, even slow on CPU

Note: my pytorch is 0.3.0, and anaconda is 3-5.0.1. Both are latest version. My CPU is Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz , and GPU is Tesla K80. My cuda is 8.0

@fmassa @ngimel

@KeCh96

This comment has been minimized.

Show comment
Hide comment
@KeCh96

KeCh96 Feb 4, 2018

I also test Conv3d operation on both GPU and CPU. Also, input_channel*K = output_channel, where K is 1 or 2. And I set batchsize=8. The result is as below:

GPU:

group=1 examples_per_sec: 170
group=4 examples_per_sec: 154
group=in_channel examples_per_sec: 22 (where in_channel can be 64,128,256,512)

CPU:

group=1 examples_per_sec: 3
group=4 examples_per_sec: 3
group=in_channel examples_per_sec: 3 (where in_channel can be 64,128,256,512)

Note: This experiment is done on another computer, which CPU is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and GPU is TITAN Xp.

I have checked these experiments and feel great confused. Anyone run into the same problem?

KeCh96 commented Feb 4, 2018

I also test Conv3d operation on both GPU and CPU. Also, input_channel*K = output_channel, where K is 1 or 2. And I set batchsize=8. The result is as below:

GPU:

group=1 examples_per_sec: 170
group=4 examples_per_sec: 154
group=in_channel examples_per_sec: 22 (where in_channel can be 64,128,256,512)

CPU:

group=1 examples_per_sec: 3
group=4 examples_per_sec: 3
group=in_channel examples_per_sec: 3 (where in_channel can be 64,128,256,512)

Note: This experiment is done on another computer, which CPU is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and GPU is TITAN Xp.

I have checked these experiments and feel great confused. Anyone run into the same problem?

@KeCh96

This comment has been minimized.

Show comment
Hide comment
@KeCh96

KeCh96 Feb 11, 2018

Does "bandwidth-bound" means transfering data between CPU and GPU? If so, we should not run into "bandwidth-bound" when we only use CPU. But in my experiment above, I found the speed also drops when I use CPU. @ngimel

KeCh96 commented Feb 11, 2018

Does "bandwidth-bound" means transfering data between CPU and GPU? If so, we should not run into "bandwidth-bound" when we only use CPU. But in my experiment above, I found the speed also drops when I use CPU. @ngimel

@tstandley

This comment has been minimized.

Show comment
Hide comment
@tstandley

tstandley Mar 2, 2018

I'm not sure I buy the "bandwidth bound" explanation either. If you increase the kernal size from 3x3 to say 5x5, your operation needs the same amount of non-cache memory, but takes way longer.

I don't know how to program cuda, but here's how I see the operation working:
Let's say we run a Conv2d(1024, 1024, kernel_size=3, groups=1024) on a shape 20x20x1024 tensor:

Read a single channel from a single layer into cache. That's 400 floats (1600 bytes). It will fit into cache.
Read that channels worth of parameters into cache (that's just 9 floats)

Then we do the following 18x18 times:
Multiply those 9 floats (weights) by the appropriate 9 floats in the channel from cache.
Store the output in the appropriate cell in a 18x18 matrix in main memory.

Here we have one main memory read and one main memory write per channel per spacial location.

If we do the same operation with Conv2d(1024, 1024, kernel_size=5, groups=1024), we should still be doing only one read and one write per spacial location per channel from main memory.

The amount of time this takes should be insignificant next to the pointwise convolution that follows the channel wise convolution:
Conv2d(1024, 1024, kernel_size=1, groups=1)

Here we need to read a million floats from main memory and do 400 1024x1024 matrix multiplies.

The channel-wise operation shouldn't even register next to that.

I really wish we could get this working. We could have blazing fast convolutions with arbitrarily large kernels.

tstandley commented Mar 2, 2018

I'm not sure I buy the "bandwidth bound" explanation either. If you increase the kernal size from 3x3 to say 5x5, your operation needs the same amount of non-cache memory, but takes way longer.

I don't know how to program cuda, but here's how I see the operation working:
Let's say we run a Conv2d(1024, 1024, kernel_size=3, groups=1024) on a shape 20x20x1024 tensor:

Read a single channel from a single layer into cache. That's 400 floats (1600 bytes). It will fit into cache.
Read that channels worth of parameters into cache (that's just 9 floats)

Then we do the following 18x18 times:
Multiply those 9 floats (weights) by the appropriate 9 floats in the channel from cache.
Store the output in the appropriate cell in a 18x18 matrix in main memory.

Here we have one main memory read and one main memory write per channel per spacial location.

If we do the same operation with Conv2d(1024, 1024, kernel_size=5, groups=1024), we should still be doing only one read and one write per spacial location per channel from main memory.

The amount of time this takes should be insignificant next to the pointwise convolution that follows the channel wise convolution:
Conv2d(1024, 1024, kernel_size=1, groups=1)

Here we need to read a million floats from main memory and do 400 1024x1024 matrix multiplies.

The channel-wise operation shouldn't even register next to that.

I really wish we could get this working. We could have blazing fast convolutions with arbitrarily large kernels.

@SeungjunNah

This comment has been minimized.

Show comment
Hide comment
@SeungjunNah

SeungjunNah Mar 27, 2018

Hi,

Currently, pytorch is using thnn implementation of depthwise convolution, thnn_conv_depthwise2d, instead of cudnn.

According to recent cudnn 7.1.1 release notes, it seems like cudnn has implemented group convolution for groupCount>1 for all forward & backward algorithms.
Also, current cudnn api reference says the winograd algorithm supports group count greater than 0.

Are there any plans to switch group convolution backend from thnn to cudnn?

SeungjunNah commented Mar 27, 2018

Hi,

Currently, pytorch is using thnn implementation of depthwise convolution, thnn_conv_depthwise2d, instead of cudnn.

According to recent cudnn 7.1.1 release notes, it seems like cudnn has implemented group convolution for groupCount>1 for all forward & backward algorithms.
Also, current cudnn api reference says the winograd algorithm supports group count greater than 0.

Are there any plans to switch group convolution backend from thnn to cudnn?

@ibmua

This comment has been minimized.

Show comment
Hide comment
@ibmua

ibmua Apr 8, 2018

I'm also much interested in this. NVidia guy emailed me a year ago saying they're gonna implement grouping feature. Yet, it's still nowhere to be found.. This is one of the biggest things in DL in the last several years, yet it doesn't come into fruition. Hope you guys are gonna include it fast, and thanks @SeungjunNah for noticing it in the release notes. Great to know it's at least felly ready on cudnn level by now. Can't wait to see my GPUs chugging conv nets at 4x speed.

ibmua commented Apr 8, 2018

I'm also much interested in this. NVidia guy emailed me a year ago saying they're gonna implement grouping feature. Yet, it's still nowhere to be found.. This is one of the biggest things in DL in the last several years, yet it doesn't come into fruition. Hope you guys are gonna include it fast, and thanks @SeungjunNah for noticing it in the release notes. Great to know it's at least felly ready on cudnn level by now. Can't wait to see my GPUs chugging conv nets at 4x speed.

@colinfang

This comment has been minimized.

Show comment
Hide comment
@colinfang

colinfang Apr 24, 2018

To benefit from cudnn 7.1, is it as simple as removing if (params.is_depthwise(input, weight)) branch in https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L337, so that it falls back to cudnn's native implementations? And turn on cudnn.benchmark = True for the WINOGRAD has a chance to kick in. Sadly in my case it is sill at least 2x slower than Tensorflow's own tailored cuda version.

colinfang commented Apr 24, 2018

To benefit from cudnn 7.1, is it as simple as removing if (params.is_depthwise(input, weight)) branch in https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L337, so that it falls back to cudnn's native implementations? And turn on cudnn.benchmark = True for the WINOGRAD has a chance to kick in. Sadly in my case it is sill at least 2x slower than Tensorflow's own tailored cuda version.

@soumith

This comment has been minimized.

Show comment
Hide comment
@soumith

soumith Apr 24, 2018

Member

@colinfang fwiw cudnn hasn't optimized for depthwise convolution, only optimized for group convolution (which is a general case of depthwise)

Member

soumith commented Apr 24, 2018

@colinfang fwiw cudnn hasn't optimized for depthwise convolution, only optimized for group convolution (which is a general case of depthwise)

@ngimel

This comment has been minimized.

Show comment
Hide comment
@ngimel

ngimel Apr 24, 2018

Contributor

cudnn has some kernels for depthwise-separable but on average they are no better than pytorch's. Feel free to bring over Tensorflow's tailored implementation to pytorch.

Contributor

ngimel commented Apr 24, 2018

cudnn has some kernels for depthwise-separable but on average they are no better than pytorch's. Feel free to bring over Tensorflow's tailored implementation to pytorch.

@colinfang

This comment has been minimized.

Show comment
Hide comment
@colinfang

colinfang Apr 24, 2018

In my case the input is 9x9, kernel is 5x5, padding=4 and the input channel == output channel == 50000 == groups, batchsize = 1. I tried both cudnn's native conv2d & conv_transpose_2d. The transpose version is slower in forward, but faster in backward. Overall they have similar performance. And thnn_conv_depthwise2d is similar to the fastest cudnn 7.1 's convolution version. Perhaps I didn't config correctly.

colinfang commented Apr 24, 2018

In my case the input is 9x9, kernel is 5x5, padding=4 and the input channel == output channel == 50000 == groups, batchsize = 1. I tried both cudnn's native conv2d & conv_transpose_2d. The transpose version is slower in forward, but faster in backward. Overall they have similar performance. And thnn_conv_depthwise2d is similar to the fastest cudnn 7.1 's convolution version. Perhaps I didn't config correctly.

@fmassa

This comment has been minimized.

Show comment
Hide comment
@fmassa

fmassa Apr 24, 2018

Member

I haven't checked the TF implementation, but I believe one reason why it might be faster in his case is because his number of channels is fairly large, and TF use by default NHWC layout?

Member

fmassa commented Apr 24, 2018

I haven't checked the TF implementation, but I believe one reason why it might be faster in his case is because his number of channels is fairly large, and TF use by default NHWC layout?

@ngimel

This comment has been minimized.

Show comment
Hide comment
@ngimel

ngimel Apr 24, 2018

Contributor

Also, I don't know how heavily templated tf's is. In pytorch, 5x5 templates are not instantiated and fall to generic case, which might be 2 times slower than if they were instantiated.

Contributor

ngimel commented Apr 24, 2018

Also, I don't know how heavily templated tf's is. In pytorch, 5x5 templates are not instantiated and fall to generic case, which might be 2 times slower than if they were instantiated.

@colinfang

This comment has been minimized.

Show comment
Hide comment
@colinfang

colinfang Apr 24, 2018

I think tf also only does 3x3 kernel templates. But it has special code path for image of size up to 32x32, not sure if it is relevant. (NHWC or NCHW don't make much difference in tf)
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc#L41 The fastest algorithm for forward in cudnn7.1 is conv2d_grouped_direct_kernel, (spotted from nvprof), not sure what that is. (It might be CUDNN_CONVOLUTION_FWD_ALGO_DIRECT but on the sdk it says it is not implementated in cudnn)

colinfang commented Apr 24, 2018

I think tf also only does 3x3 kernel templates. But it has special code path for image of size up to 32x32, not sure if it is relevant. (NHWC or NCHW don't make much difference in tf)
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc#L41 The fastest algorithm for forward in cudnn7.1 is conv2d_grouped_direct_kernel, (spotted from nvprof), not sure what that is. (It might be CUDNN_CONVOLUTION_FWD_ALGO_DIRECT but on the sdk it says it is not implementated in cudnn)

@ngimel

This comment has been minimized.

Show comment
Hide comment
@ngimel

ngimel Apr 24, 2018

Contributor

It does not look like it's too hard to bring it over to pytorch, so if you want pytorch to have awesome performance on small inputs, that might be your best bet :-)

Contributor

ngimel commented Apr 24, 2018

It does not look like it's too hard to bring it over to pytorch, so if you want pytorch to have awesome performance on small inputs, that might be your best bet :-)

@assassint2017

This comment has been minimized.

Show comment
Hide comment
@assassint2017

assassint2017 Jul 14, 2018

Hello everyone, I am really confused now. Can someone summarize in one sentence explain how to use high-performance depthwise convolution? The specified pytorch version? Cuda version or cudnn version?

assassint2017 commented Jul 14, 2018

Hello everyone, I am really confused now. Can someone summarize in one sentence explain how to use high-performance depthwise convolution? The specified pytorch version? Cuda version or cudnn version?

@ezyang

This comment has been minimized.

Show comment
Hide comment
@ezyang

ezyang Jul 14, 2018

Contributor

PyTorch 0.4 with cuDNN should be sufficient.

Contributor

ezyang commented Jul 14, 2018

PyTorch 0.4 with cuDNN should be sufficient.

@assassint2017

This comment has been minimized.

Show comment
Hide comment
@assassint2017

assassint2017 commented Jul 15, 2018

cuDNN7?

@austingg

This comment has been minimized.

Show comment
Hide comment
@austingg

austingg Jul 16, 2018

latest cudnn7 patch has supported depthwise conv path

austingg commented Jul 16, 2018

latest cudnn7 patch has supported depthwise conv path

@Kongsea

This comment has been minimized.

Show comment
Hide comment
@Kongsea

Kongsea Aug 6, 2018

Contributor

I have a question.
For example, if my input is (64, 128, 7, 7), then if using depthwise convolution, it will be: nn.Conv2d(128, 256, 3, groups=128).
However, in this situation, there will be 256 different kernels. But if I want this 256 kernels to be a some one kernel (i.e., sharing kernel weights). How to set the parameters in this situation?
Thank you.

Contributor

Kongsea commented Aug 6, 2018

I have a question.
For example, if my input is (64, 128, 7, 7), then if using depthwise convolution, it will be: nn.Conv2d(128, 256, 3, groups=128).
However, in this situation, there will be 256 different kernels. But if I want this 256 kernels to be a some one kernel (i.e., sharing kernel weights). How to set the parameters in this situation?
Thank you.

@wandering007

This comment has been minimized.

Show comment
Hide comment
@wandering007

wandering007 Aug 6, 2018

Contributor

@Kongsea

m = nn.Conv2d(128, 256, 3, groups=128)
m.weight.data = m.weight.data[0].expand(256, *m.weight.shape[1:])

Expanded kernel weights are shared with the same memory.

Contributor

wandering007 commented Aug 6, 2018

@Kongsea

m = nn.Conv2d(128, 256, 3, groups=128)
m.weight.data = m.weight.data[0].expand(256, *m.weight.shape[1:])

Expanded kernel weights are shared with the same memory.

@myih

This comment has been minimized.

Show comment
Hide comment
@myih

myih Aug 14, 2018

@austingg Hi I can't find anything about supporting depthwise convolution in Nvidia's documentations and release notes.

myih commented Aug 14, 2018

@austingg Hi I can't find anything about supporting depthwise convolution in Nvidia's documentations and release notes.

@Kongsea

This comment has been minimized.

Show comment
Hide comment
@Kongsea

Kongsea Aug 14, 2018

Contributor

@wandering007 Thank you.

Contributor

Kongsea commented Aug 14, 2018

@wandering007 Thank you.

@austingg

This comment has been minimized.

Show comment
Hide comment
@austingg

austingg Aug 27, 2018

@myih cudnn v7 release note. Performance improvements for grouped convolutions when input channels and output channels per group are 1, 2, or 4 for the following algorithms

austingg commented Aug 27, 2018

@myih cudnn v7 release note. Performance improvements for grouped convolutions when input channels and output channels per group are 1, 2, or 4 for the following algorithms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment