overparametrized convolution error in tf.nn.separable_conv2d #4330

Jongchan · 2016-09-12T04:11:10Z

I'm trying to build a network of layers with channel-wise separable convolution. (Just like the paper Factorized CNN)

I've found that separable_conv2d and depthwise_conv2d is the two options, and I'm trying out these two.

What I am trying to build is as below

depthwise_filter = tf.get_variable("depth_conv_w", [3,3,64,3], initialize=tf.random_normal_initializer(stddev=np.sqrt(2.0/9/32)))
pointwise_filter = tf.get_variable("point_conv_w", [1,1,192,64], initializer=tf.random_normal_initializer(stddev=np.sqrt(2.0/9/128)))
conv_tensor = tf.nn.depthwise_conv2d(tensor, depthwise_filter, [1,1,1,1], padding='SAME')
conv_tensor = tf.nn.conv2d(conv_tensor, pointwise_filter, [1,1,1,1], padding='VALID')

and it works fine.

However, if I switch the last 2 lines with

conv_tensor=tf.nn.separable_conv2d(tensor,depthwise_filter,pointwise_filter,[1,1,1,1],padding='SAME')

then tensorflow gives me 'overparamatrized convolution error' as specified in here

I think channel_multiplier * in_channel is usually larger than or equal to the out_channel.
So I believe this is an error.

PS) I can get the same result as the separable_conv2d with depthwise_conv2d and 1x1 convolution. Is there any advantage of using separable_conv2d?

The text was updated successfully, but these errors were encountered:

andydavis1 · 2016-09-12T18:10:28Z

From a previous discussion of this topic:

"If this inequality isn't satisfied, you're expanding the number of activations and then reducing it, which is usually not a good idea. In particular, it often means you're using more parameters and FLOPS than you would with a regular convolution, which defeats the purpose of using a separable conv."

Jongchan · 2016-09-13T00:18:57Z

I thought that the purpose of using separable conv is to share the intermediate activations before the linear channel projection (in a normal convolution, they are not shared), which is far efficient.
A diagram from 'Factorized Convolutional Neural Networks' explains this concept well.

Maybe I am misunderstanding the whole idea of separable_conv2d. Do you have the link for the previous discussion? @andydavis1

andydavis1 · 2016-09-13T21:14:07Z

Unfortunately, I don't have a link to discussion. My understanding is that separable convolutions are used to reduce flop count (exploiting redundancy across the channels), which is why we raise errors for those parameter combinations.

You may get more information by posting this on Stackoverflow with the tensorflow tag, so I'm going to close this out for now.

mikowals · 2016-09-13T21:40:19Z

Are you sure the math checking this isn't just being done wrong? In a standard convolution the number of parameters is height x width x in_channels x out_channels. By using a filter with height and width of one for the projection to out_channels in separable convolution it saves parameters by a factor of nearly height x width.

So a proper check for over parameterization is closer to in_channels x channel_multiplier > out_channels x height x width.

Jongchan · 2016-09-14T02:19:00Z

Yes, I understand that it can reduce flop count significantly, and that when channel_multiplier is large, the operation becomes very complex.
However, I think it should not be raised as an error for 2 reasons:

separable_conv2d is not equivalent to conventional convolution.
the amount of flops may be less even when in_channel * channel_multiplier > out_channel. Will elaborate more below.

I will assume in_channel = out_channel since many modules (layers) in deep architecture has such config.

For a single patch, the number of multiplications in separable_conv2d is in_channel * filter_w * filter_h * channel_multiplier + 1 * 1 * (in_channel * channel_multiplier) * out_channel.
The number of multiplications in conventional conv is in_channel * filter_w * filter_h * out_channel.
When we compare two computations, where out_channel >> channel_multiplier in most cases, the amount of computations may be less for separable_conv2d.

Is it really necessary to add such a restriction that in_channel * channel_multiplier <= out_channel ?
(in my case, for a single patch, channel_multiplier=3, the number of multiplications is 64 * 3 * 3 * 3+1 * 1 * 192 * 64 = 18,624, where the conventional conv yields 64 * 3 * 3 * 64=36,864)

vrv · 2016-09-14T02:25:55Z

cc @vincentvanhoucke in case he wants to comment.

andydavis1 · 2016-09-14T15:02:27Z

Yes. I worked out a back-of-the-envelope flop cost between separable and conventional convolutions:

OR = out_rows
OC = out_cols
ID = in_depth
DM = depth_multiplier
OD = out_depth
FR = filter_rows
FC = filter_cols

separable_conv_cost = OR * OC * ID * DM * (FR * FC + OD)
conventional_conv_cost = OR * OC * ID * FR * FC * OD

// So to save on flops we want:

DM < (FR * FC * OD) / (FR * FC + OD)

// So plugging in your numbers from above (FR=FC=3, OD=192):

DM < 8.5

So to save flops, we want DM < 8.5, yet your DM = 3. So perhaps (at least from a flops perspective), this check is too restrictive. There may be other reasons for this restriction (perhaps expansion and reduction of parameters leads to training issues), but I'll let Vincent comment on that...

vincentvanhoucke · 2016-09-14T15:55:55Z

If you have valid uses for a separable convolution which blows up the number of activations and then shrinks them back, then I'm ok accepting a PR that removes this check. Separable convolutions predate this paper by a couple of years, so it's entirely possible that people have found productive uses of this regime. In general, introducing an activation bottleneck anywhere in the network tends to be a terrible idea from an optimization standpoint, and unless your dimensions are widely pathological, it also implies that you're introducing more free parameters than there was in the convolution in the first place. But one can argue there is nothing broken about doing so.

Jongchan · 2016-09-21T15:13:28Z

Thank you for your comments, I'm sorry I am 1 week late with my response.
I understand your concerns regarding computational optimization. Productive uses of such kind of operation should be verified. However, the researchers/developers should be able to choose it freely. Since the computation/free params may blow up, a warning can be added to the documentation.
I will request a PR soon, but I am very new to Github, need to search how to do it...

singlasahil14 · 2017-05-24T12:11:10Z

@vincentvanhoucke : Suppose I am doing style transfer on phone. That will require depthwise separable convolution in place of convolution. The thing is, this involves a decoder that typically reduces number of channels from 128 to 64, then 64 to 32, then 32 to 3. In all 3 cases, you will see the number of channels being decreased. Do you believe that is a valid use case? I wanted to use separable convolution, but cannot because of this reason in style transfer.

vincentvanhoucke · 2017-05-24T13:36:52Z

Like I said, I'd gladly and swiftly accept a PR that removes that check.

vrv · 2017-05-24T19:09:07Z

I'll send a change for it.

singlasahil14 · 2017-05-25T00:39:26Z

Thank you! Using depthwise convolution followed by pointwise convolution is very slow. Like it takes almost double the time.

Fixes #4330. RELNOTES: Allow uses of over-parameterized separable convolution. PiperOrigin-RevId: 157035904

ghost · 2017-11-09T03:24:49Z

@singlasahil14 Could you use separable conv in your model? I also want to use separable conv but in keras, or does it only work in tensorflow? . Can you share your code ?.

singlasahil14 · 2017-11-09T03:31:20Z

@arnoldaclf : I wasn't trying to use it in keras.
https://github.com/singlasahil14/style-transfer/blob/master/style_network_factory.py
Here is my code.

andydavis1 added the stat:awaiting response Status - Awaiting response from author label Sep 12, 2016

andydavis1 closed this as completed Sep 13, 2016

vrv assigned vincentvanhoucke Sep 14, 2016

andydavis1 reopened this Sep 14, 2016

Jongchan closed this as completed Sep 21, 2016

maciekcc pushed a commit that referenced this issue May 25, 2017

Allow uses of over-parameterized separable_conv.

1822073

Fixes #4330. RELNOTES: Allow uses of over-parameterized separable convolution. PiperOrigin-RevId: 157035904

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overparametrized convolution error in tf.nn.separable_conv2d #4330

overparametrized convolution error in tf.nn.separable_conv2d #4330

Jongchan commented Sep 12, 2016 •

edited

andydavis1 commented Sep 12, 2016

Jongchan commented Sep 13, 2016

andydavis1 commented Sep 13, 2016

mikowals commented Sep 13, 2016

Jongchan commented Sep 14, 2016 •

edited

vrv commented Sep 14, 2016

andydavis1 commented Sep 14, 2016

vincentvanhoucke commented Sep 14, 2016

Jongchan commented Sep 21, 2016 •

edited

singlasahil14 commented May 24, 2017

vincentvanhoucke commented May 24, 2017

vrv commented May 24, 2017

singlasahil14 commented May 25, 2017

ghost commented Nov 9, 2017

singlasahil14 commented Nov 9, 2017

overparametrized convolution error in tf.nn.separable_conv2d #4330

overparametrized convolution error in tf.nn.separable_conv2d #4330

Comments

Jongchan commented Sep 12, 2016 • edited

andydavis1 commented Sep 12, 2016

Jongchan commented Sep 13, 2016

andydavis1 commented Sep 13, 2016

mikowals commented Sep 13, 2016

Jongchan commented Sep 14, 2016 • edited

vrv commented Sep 14, 2016

andydavis1 commented Sep 14, 2016

vincentvanhoucke commented Sep 14, 2016

Jongchan commented Sep 21, 2016 • edited

singlasahil14 commented May 24, 2017

vincentvanhoucke commented May 24, 2017

vrv commented May 24, 2017

singlasahil14 commented May 25, 2017

ghost commented Nov 9, 2017

singlasahil14 commented Nov 9, 2017

Jongchan commented Sep 12, 2016 •

edited

Jongchan commented Sep 14, 2016 •

edited

Jongchan commented Sep 21, 2016 •

edited