# **From Paper To Keras: MobileNets With TensorFlow ( Notebook is in maintainence! )**

> *This notebook is in maintainence and wouldn't give desired outputs*

[<img src="https://github.com/shubham0204/Privacy_Policy_Texts/blob/master/notebook_button_two.png?raw=true" width="170" height="50" align="center">](https://medium.com/@equipintelligence/exploring-mobilenets-from-paper-to-keras-f01308ada818)
[<img src="https://github.com/shubham0204/Privacy_Policy_Texts/raw/master/read_the_paper_button.png" width="180" height="50" align="center">](https://arxiv.org/abs/1704.04861)</p>

---

MobileNets are special CNNs made for mobile devices and embedded devices. What makes them different from other models like the VGG, DenseNet or Inception, is that they have a far less number of trainable parameters which provides better latency on say mobile devices ( or devices which have limited computational power ). They are equipped with Separable Convolutions which have less trainable parameters than regular Convolutions.

The below images are taken from [A Basic Introduction to Separable Convolutions](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728) by [Chi-Feng Wang](https://towardsdatascience.com/@reina.wang) which is a must read to understand Separable Convolutions.

---

As you may observe in the [paper](https://arxiv.org/abs/1704.04861),

For a <u>standard convolution</u> ( like `tf.keras.layers.Conv2D` ), the output feature map is computed as,

$\Large \mathbf{G}_{k, l, n}=\sum_{i, j, m} \mathbf{K}_{i, j, m, n} \cdot \mathbf{F}_{k+i-1, l+j-1, m}$

Where $G$ is the output feature map, $K$ and $F$ are the kernel and the input feature map respectively. 

We create a kernel of size $D_F \times D_F \times M \times N$ to transform a $D_F \times D_F \times M$ feature map to a $D_F \times D_F \times N$ output feature map. $D_F$ is the dimension of the square feature map, $M$ and $N$ are the number of input and output feature maps

[<img src="https://miro.medium.com/max/1400/1*XloAmCh5bwE4j1G7yk5THw.png" width="500" height="200" align="center">]()


To understand the image above, we are taking a $12 \times 12 \times 3$ image and a $5 \times 5 \times 3$ kernel to produce a $8 \times 8 \times 1$. Likewise, we create 256 kernels to finally produce $8 \times 8 \times 256$ output feature. 

---

For a <u>separable convolution</u>, the output feature map is computed as,

$\Large \hat{\mathbf{G}}_{k, l, m}=\sum_{i, j} \hat{\mathbf{K}}_{i, j, m} \cdot \mathbf{F}_{k+i-1, l+j-1, m}$

[<img src="https://miro.medium.com/max/1400/1*Q7a20gyuunpJzXGnWayUDQ.png" width="500" height="200" align="center">]()


In separable convolutions, considering the above example, we take in a $12 \times 12 \times 3$ and 3 kernels which are $5 \times 5 \times 1$. These 3 kernels run over the 3 channels of the image, producing a output feature map of size $8 \times 8 \times 1$. Using 1D Convolutions, callled pointwise convolutions, like `tf.keras.layers.Conv1D`, we increase the depth of our output feature map from $8 \times 8 \times 1$ to $8 \times 8 \times 256$ feature map.

**We'll require a GPU Hardware accelerator for training the model. Change the runtime type to GPU by going to `Tools > Change Runtime Type > Hardware Accelerator > GPU`.**




# 1) Loading the Data from TensorFlow Datasets

We download our training/testing data using [TensorFlow Datasets](https://www.tensorflow.org/datasets). We'll use the  [RockPaperScissors](http://www.laurencemoroney.com/rock-paper-scissors-dataset/) dataset by [Laurence Moroney](http://www.laurencemoroney.com/). It contains images which contains the rock, paper and scissors actions.

* We'll first convert the `tf.data.Dataset` to a NumPy array.
* Then we'll normalize the images and one-hot encode the corresponding labels.




In [None]:

import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

train_dataset = tfds.load( name="beans" , split="train" )
test_dataset = tfds.load( name="beans" , split="test" )

train_X = [ x[ 'image' ] for x in tfds.as_numpy( train_dataset ) ]
train_Y = [ x[ 'label' ] for x in tfds.as_numpy( train_dataset ) ]
test_X = [ x[ 'image' ] for x in tfds.as_numpy( test_dataset ) ]
test_Y = [ x[ 'label' ] for x in tfds.as_numpy( test_dataset ) ]

train_X = np.array( train_X ) / 255
train_Y = np.array( train_Y )
test_X = np.array( test_X ) / 255
test_Y = np.array( test_Y )

train_Y = tf.keras.utils.to_categorical( train_Y , num_classes=3 )
test_Y = tf.keras.utils.to_categorical( test_Y , num_classes=3 )



# 2) The MobileNet model

As per the [paper](https://arxiv.org/abs/1704.04861), MobileNet has 29 layers. They are alternate standard convolutions and separable convolutions. All convolutional layers, standard as well as depthwise, are followed by batch normalization and Leaky ReLU activation layers.

The last layers are the Average Pooling and Softmax layers. The softmax layer produces the final class probabilities. The model architecture is as follows,


![alt text](https://github.com/shubham0204/Privacy_Policy_Texts/blob/master/Capture.PNG?raw=true)

$s2,s1$ denote the number of strides. $Conv \ dw$ is the depthwise convolution which we have used as `SeparableConv`. $Conv$ is the standard convolution defined below as `Conv`.

Also, there's a **Width Multiplier** denoted as $\alpha$ where $\alpha \in ( 0 , 1 ]$. So every layer will receive $\alpha M$ feature maps and produce $\alpha N$ feature maps.

Second, there's a **Resolution Multipler** denoted as $\rho$ where $\rho \in ( 0 , 1 ]$. Using this every layer will be square input feature maps of size $\rho D_f$

In [None]:

# Note: You may use tf.keras.layers.DepthwiseConv2D but you won't be able to add BatchNorm and LeakyReLU layers.
# Hence, we are first performing depthwise convolutions and then a Conv2D with kernel size of 1.
def SeparableConv( x , num_filters , strides , alpha=1.0 ):
    x = tf.keras.layers.DepthwiseConv2D( kernel_size=3 , padding='same' )( x )
    x = tf.keras.layers.BatchNormalization(momentum=0.9997)( x )
    x = tf.keras.layers.Activation( 'relu' )( x )
    x = tf.keras.layers.Conv2D( np.floor( num_filters * alpha ) , kernel_size=( 1 , 1 ) , strides=strides , use_bias=False , padding='same' )( x )
    x = tf.keras.layers.BatchNormalization(momentum=0.9997)(x)
    x = tf.keras.layers.Activation('relu')(x)
    return x

def Conv( x , num_filters , kernel_size , strides=1 , alpha=1.0 ):
    x = tf.keras.layers.Conv2D( np.floor( num_filters * alpha ) , kernel_size=kernel_size , strides=strides , use_bias=False , padding='same' )( x )
    x = tf.keras.layers.BatchNormalization( momentum=0.9997 )(x)
    x = tf.keras.layers.Activation('relu')(x)
    return x

# The number of classes are three.
num_classes = 3

# The shape of the input image.
inputs = tf.keras.layers.Input( shape=( 500 , 500 , 3 ) )

x = Conv( inputs , num_filters=32 , kernel_size=3 , strides=2 )
x = SeparableConv( x , num_filters=32 , strides=1 )
x = Conv( x , num_filters=64 , kernel_size=1 )
x = SeparableConv( x , num_filters=64 , strides=2  )
x = Conv( x , num_filters=128 , kernel_size=1 )
x = SeparableConv( x , num_filters=128 , strides=1  )
x = Conv( x , num_filters=128 , kernel_size=1 )
x = SeparableConv( x , num_filters=128 , strides=2  )
x = Conv( x , num_filters=256 , kernel_size=1 )
x = SeparableConv( x , num_filters=256 , strides=1  )
x = Conv( x , num_filters=256 , kernel_size=1 )
x = SeparableConv( x , num_filters=256 , strides=2  )
x = Conv( x , num_filters=512 , kernel_size=1 )

# You may uncomment the code below if you're machine could tolerate such heavy computation!
#for i in range( 5 ):
    #x = SeparableConv(x, num_filters=512 , strides=1 )
    #x = Conv(x, num_filters=512 , kernel_size=1 )

x = SeparableConv(x, num_filters=512 , strides=2 )
x = Conv(x, num_filters=1024 , kernel_size=1 )
x = tf.keras.layers.AveragePooling2D( pool_size=( 7 , 7 ) )( x )
x = tf.keras.layers.Flatten()( x )
x = tf.keras.layers.Dense( num_classes )( x )
outputs = tf.keras.layers.Activation( 'softmax' )( x )

model = tf.keras.models.Model( inputs , outputs )

# As we doing classification, we'll use categorical crossentropy and the RMSProp optimizer.
model.compile( loss='categorical_crossentropy' , optimizer=tf.keras.optimizers.Adam( learning_rate=0.001 ) , metrics=[ 'acc' ] )



# 3) Training the model

We'll train the model and see how well it works!



In [None]:

model.fit( train_X , train_Y , epochs=25 , batch_size=25 , validation_data=( test_X , test_Y ) )


Finally, evaluate the model on the test split we made earlier.

In [None]:

model.evaluate( test_X , test_Y )
