Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overparametrized convolution error in tf.nn.separable_conv2d #4330

Closed
Jongchan opened this issue Sep 12, 2016 · 15 comments
Closed

overparametrized convolution error in tf.nn.separable_conv2d #4330

Jongchan opened this issue Sep 12, 2016 · 15 comments
Assignees
Labels
stat:awaiting response Status - Awaiting response from author

Comments

@Jongchan
Copy link

Jongchan commented Sep 12, 2016

I'm trying to build a network of layers with channel-wise separable convolution. (Just like the paper Factorized CNN)

I've found that separable_conv2d and depthwise_conv2d is the two options, and I'm trying out these two.

What I am trying to build is as below

depthwise_filter = tf.get_variable("depth_conv_w", [3,3,64,3], initialize=tf.random_normal_initializer(stddev=np.sqrt(2.0/9/32)))
pointwise_filter = tf.get_variable("point_conv_w", [1,1,192,64], initializer=tf.random_normal_initializer(stddev=np.sqrt(2.0/9/128)))
conv_tensor = tf.nn.depthwise_conv2d(tensor, depthwise_filter, [1,1,1,1], padding='SAME')
conv_tensor = tf.nn.conv2d(conv_tensor, pointwise_filter, [1,1,1,1], padding='VALID')

and it works fine.

However, if I switch the last 2 lines with

conv_tensor=tf.nn.separable_conv2d(tensor,depthwise_filter,pointwise_filter,[1,1,1,1],padding='SAME')

then tensorflow gives me 'overparamatrized convolution error' as specified in here

I think channel_multiplier * in_channel is usually larger than or equal to the out_channel.
So I believe this is an error.

PS) I can get the same result as the separable_conv2d with depthwise_conv2d and 1x1 convolution. Is there any advantage of using separable_conv2d?

@andydavis1
Copy link
Contributor

From a previous discussion of this topic:

"If this inequality isn't satisfied, you're expanding the number of activations and then reducing it, which is usually not a good idea. In particular, it often means you're using more parameters and FLOPS than you would with a regular convolution, which defeats the purpose of using a separable conv."

@andydavis1 andydavis1 added the stat:awaiting response Status - Awaiting response from author label Sep 12, 2016
@Jongchan
Copy link
Author

I thought that the purpose of using separable conv is to share the intermediate activations before the linear channel projection (in a normal convolution, they are not shared), which is far efficient.
A diagram from 'Factorized Convolutional Neural Networks' explains this concept well.

Maybe I am misunderstanding the whole idea of separable_conv2d. Do you have the link for the previous discussion? @andydavis1

@andydavis1
Copy link
Contributor

Unfortunately, I don't have a link to discussion. My understanding is that separable convolutions are used to reduce flop count (exploiting redundancy across the channels), which is why we raise errors for those parameter combinations.

You may get more information by posting this on Stackoverflow with the tensorflow tag, so I'm going to close this out for now.

@mikowals
Copy link
Contributor

Are you sure the math checking this isn't just being done wrong? In a standard convolution the number of parameters is height x width x in_channels x out_channels. By using a filter with height and width of one for the projection to out_channels in separable convolution it saves parameters by a factor of nearly height x width.

So a proper check for over parameterization is closer to in_channels x channel_multiplier > out_channels x height x width.

@Jongchan
Copy link
Author

Jongchan commented Sep 14, 2016

Yes, I understand that it can reduce flop count significantly, and that when channel_multiplier is large, the operation becomes very complex.
However, I think it should not be raised as an error for 2 reasons:

  1. separable_conv2d is not equivalent to conventional convolution.
  2. the amount of flops may be less even when in_channel * channel_multiplier > out_channel. Will elaborate more below.

I will assume in_channel = out_channel since many modules (layers) in deep architecture has such config.

For a single patch, the number of multiplications in separable_conv2d is in_channel * filter_w * filter_h * channel_multiplier + 1 * 1 * (in_channel * channel_multiplier) * out_channel.
The number of multiplications in conventional conv is in_channel * filter_w * filter_h * out_channel.
When we compare two computations, where out_channel >> channel_multiplier in most cases, the amount of computations may be less for separable_conv2d.

Is it really necessary to add such a restriction that in_channel * channel_multiplier <= out_channel ?
(in my case, for a single patch, channel_multiplier=3, the number of multiplications is 64 * 3 * 3 * 3+1 * 1 * 192 * 64 = 18,624, where the conventional conv yields 64 * 3 * 3 * 64=36,864)

@vrv
Copy link

vrv commented Sep 14, 2016

cc @vincentvanhoucke in case he wants to comment.

@andydavis1
Copy link
Contributor

Yes. I worked out a back-of-the-envelope flop cost between separable and conventional convolutions:

OR = out_rows
OC = out_cols
ID = in_depth
DM = depth_multiplier
OD = out_depth
FR = filter_rows
FC = filter_cols

separable_conv_cost = OR * OC * ID * DM * (FR * FC + OD)
conventional_conv_cost = OR * OC * ID * FR * FC * OD

// So to save on flops we want:

DM < (FR * FC * OD) / (FR * FC + OD)

// So plugging in your numbers from above (FR=FC=3, OD=192):

DM < 8.5

So to save flops, we want DM < 8.5, yet your DM = 3. So perhaps (at least from a flops perspective), this check is too restrictive. There may be other reasons for this restriction (perhaps expansion and reduction of parameters leads to training issues), but I'll let Vincent comment on that...

@andydavis1 andydavis1 reopened this Sep 14, 2016
@vincentvanhoucke
Copy link
Contributor

If you have valid uses for a separable convolution which blows up the number of activations and then shrinks them back, then I'm ok accepting a PR that removes this check. Separable convolutions predate this paper by a couple of years, so it's entirely possible that people have found productive uses of this regime. In general, introducing an activation bottleneck anywhere in the network tends to be a terrible idea from an optimization standpoint, and unless your dimensions are widely pathological, it also implies that you're introducing more free parameters than there was in the convolution in the first place. But one can argue there is nothing broken about doing so.

@Jongchan
Copy link
Author

Jongchan commented Sep 21, 2016

Thank you for your comments, I'm sorry I am 1 week late with my response.
I understand your concerns regarding computational optimization. Productive uses of such kind of operation should be verified. However, the researchers/developers should be able to choose it freely. Since the computation/free params may blow up, a warning can be added to the documentation.
I will request a PR soon, but I am very new to Github, need to search how to do it...

@singlasahil14
Copy link

@vincentvanhoucke : Suppose I am doing style transfer on phone. That will require depthwise separable convolution in place of convolution. The thing is, this involves a decoder that typically reduces number of channels from 128 to 64, then 64 to 32, then 32 to 3. In all 3 cases, you will see the number of channels being decreased. Do you believe that is a valid use case? I wanted to use separable convolution, but cannot because of this reason in style transfer.

@vincentvanhoucke
Copy link
Contributor

Like I said, I'd gladly and swiftly accept a PR that removes that check.

@vrv
Copy link

vrv commented May 24, 2017

I'll send a change for it.

@singlasahil14
Copy link

Thank you! Using depthwise convolution followed by pointwise convolution is very slow. Like it takes almost double the time.

maciekcc pushed a commit that referenced this issue May 25, 2017
Fixes #4330.

RELNOTES: Allow uses of over-parameterized separable convolution.
PiperOrigin-RevId: 157035904
@ghost
Copy link

ghost commented Nov 9, 2017

@singlasahil14 Could you use separable conv in your model? I also want to use separable conv but in keras, or does it only work in tensorflow? . Can you share your code ?.

@singlasahil14
Copy link

@arnoldaclf : I wasn't trying to use it in keras.
https://github.com/singlasahil14/style-transfer/blob/master/style_network_factory.py
Here is my code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests

6 participants