In [None]:
from tensorflow.keras.layers import Conv2D, MaxPool2D, Dropout, Flatten, BatchNormalization, SeparableConv2D
import tensorflow as tf
from keras.datasets import fashion_mnist
from keras.utils import np_utils
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix

# Xception

## A discussion on separable convolution and variants

* The spatial separable convolution is so named because it deals primarily with the spatial dimensions of an image and kernel: the width and the height

* A spatial separable convolution simply divides a kernel into two, smaller kernels. The most common case would be to divide a 3x3 kernel into a 3x1 and 1x3 kernel, like so:

![](https://miro.medium.com/max/1400/1*mL53fW0tJpNWEePp54y1Sg.png)

* Now, instead of doing one convolution with 9 multiplications, we do two convolutions with 3 multiplications each (6 in total) to achieve the same effect. With less multiplications, computational complexity goes down, and the network is able to run faster.

![](https://miro.medium.com/max/1400/1*o3mKhG3nHS-1dWa_plCeFw.png)

* The main issue with the spatial separable convolution is that not all kernels can be “separated” into two, smaller kernels.

## Depthwise separable conv

* Unlike spatial separable convolutions, **depthwise separable convolutions** work with kernels that cannot be “factored” into two smaller kernels. Hence, it is more commonly used. 

* The depthwise separable convolution is so named because it deals not just with the spatial dimensions, but with the depth dimension 

* An input image may have 3 channels: RGB. After a few convolutions, an image may have multiple channels. You can imagine each channel as a particular interpretation of that image; in for example, the “red” channel interprets the “redness” of each pixel, the “blue” channel interprets the “blueness” of each pixel, and the “green” channel interprets the “greenness” of each pixel. An image with 64 channels has 64 different interpretations of that image.

* Let's consider normal 2d conv, and let us assume we only one to create one filter. Below, we end up doing $5\times5\times3$ multiplications when we calculate ***one*** value in the feature map.

![](https://miro.medium.com/max/1400/1*fgYepSWdgywsqorf3bdksg.png)

* After going through a 5x5x3 kernel, the 12x12x3 image will become a 8x8x1 image.

* The above example was for one filter, how about more? What if we want to increase the number of channels in our output image? What if we want an output of size 8x8x256? See below.

![](https://miro.medium.com/max/1400/1*XloAmCh5bwE4j1G7yk5THw.png)

* Assuming we now want to create a feature map with a depth of 256, how many multiplications will we perform for each single feature map value now?

* Introducing, depthwise separable convolution.

* Made up of two main operations: depthwise convolution and a pointwise convolution.

* Step 1: Depthwise convolution performs channel-wise $n\times n$ spatial convolution. In the image below, this will be 3 5x5x1 convolutions. This will result in 25 multiplications. Each 5x5x1 kernel iterates 1 channel of the image (note: 1 channel, not all channels), getting the scalar products of every 25 pixel group, giving out a 8x8x1 image. Stacking these images together creates a 8x8x3 image

![](https://miro.medium.com/max/1400/1*yG6z6ESzsRW-9q5F_neOsg.png)



* Step 2: pointwise convolution (aka 1x1 conv). Remember, the original convolution transformed a 12x12x3 image to a 8x8x256 image. Currently, the depthwise convolution has transformed the 12x12x3 image to a 8x8x3 image. Now, we need to increase the number of channels of each image.

* We iterate a 1x1x3 kernel through our 8x8x3 image, to get a 8x8x1 image.

![](https://miro.medium.com/max/1400/1*37sVdBZZ9VK50pcAklh8AQ.png)

* And since we want 256 feature maps, we repeat this process 256 times. We can create 256 1x1x3 kernels that output a 8x8x1 image each to get a final image of shape 8x8x256.

![](https://miro.medium.com/max/1400/1*Q7a20gyuunpJzXGnWayUDQ.png) 

* Let’s calculate the number of multiplications the computer has to do in the original convolution. There are 256 5x5x3 kernels that move 8x8 times, each one is 3x5x5x8x8 = 4,800 multiplications. So far 256 filters, that’s 256x3x5x5x8x8=1,228,800 multiplications.

* What about the separable convolution? 

* In the depthwise convolution, we have 3 5x5x1 kernels that move 8x8 times. That’s 3x5x5x8x8 = 4,800 multiplications. In the pointwise convolution, we have 256 1x1x3 kernels that move 8x8 times. That’s 256x1x1x3x8x8=49,152 multiplications. Adding them up together, that’s 53,952 multiplications.

* 52,952 is a lot less than 1,228,800. With less computations, the network is able to process more in a shorter amount of time.

## Xception (modified depthwise separable convolution)

This is what we talked about, where we first apply depthwise and then pointwise convolution:

![](https://miro.medium.com/max/1400/1*VvBTMkVRus6bWOqrK1SlLQ.png)

* The modified depthwise separable convolution is the pointwise convolution followed by a depthwise convolution. This modification is motivated by the inception module in Inception-v3 that 1×1 convolution is done first before any n×n spatial convolutions. See below

![](https://miro.medium.com/max/1400/1*J8dborzVBRBupJfvR7YhuA.png)



* In the original Inception Module, there is non-linearity after the first operation. In Xception, the modified depthwise separable convolution, there is NO intermediate ReLU non-linearity.


credits: 
 
 https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728

 https://towardsdatascience.com/review-xception-with-depthwise-separable-convolution-better-than-inception-v3-image-dc967dd42568

# Task: implement Xception

To simplify the problem, use an input shape of (94,94,1) when you train the model, but when you build Xception, you should use the original shapes as per the paper, that is, (299,299,1).

The network is essentially broken up into 3 parts: entry, middle and exit. Code each one individually using the Functional API.

The shapes for each major block is provided to guide you.

Some notes below to guide you

![](https://drive.google.com/uc?export=view&id=1mwoViyi9FLhDgUBsxSsf6mmDC-Go7MXv)

In [None]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import SeparableConv2D

In [None]:
# Define some input, with 4 tensors of 8x8x3
X = tf.random.uniform((4, 8, 8, 3))

In [None]:
sep = SeparableConv2D(filters = 2, kernel_size = 2)

In [None]:
sep(X).shape

## Entry flow

In [None]:
# Use an input of 299,299,1

# to simplify, call the small blocks as follows: stem, block2, block3, block4
# Shapes for each are as follows

#(None, 150, 150, 64)
#(None, 75, 75, 128)
#(None, 38, 38, 256)
#(None, 19, 19, 728)

## Middle flow

In [None]:
# to do
# for simplicity, only repeat 3 times
# you can call them, middle1, middle2 and middle3

# shapes as follows:

#(None, 19, 19, 728)
#(None, 19, 19, 728)
#(None, 19, 19, 728)

## Exit flow

In [None]:
# to do

# (None, 10, 10, 1024) shape after the addition operation
# (None, 10, 10, 2048) after last relu
# (None, 2048) # after global pooling

In [None]:
model = Model(inputs, output)

## Load the dataset

In [None]:
# load data
(X_train, Y_train), (X_test, Y_test) = tf.keras.datasets.fashion_mnist.load_data()

## Find the unique numbers from the train labels

In [None]:
classes = np.unique(Y_train)
nClasses = len(classes)
print('Total number of outputs : ', nClasses)
print('Output classes : ', classes)

## Reshape needed

Keras wants to know the depth of an image. 

For CNNS, Keras wants the format of the data as follows: [batches, width, height, depth]. 

In this case the colour channel/depth of the images is 1. Currently the shape is:

But this doesn't have a depth value. So we can reshape it

In [None]:
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], X_train.shape[2], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], X_test.shape[2], 1))

## Convert from categorical labels to one-hot encoded vectors

In this case there are 10 classes so we can tell the function to convert into a vector of length 10

In [None]:
Y_train = np_utils.to_categorical(Y_train, 10)
Y_test = np_utils.to_categorical(Y_test, 10)
num_classes = 10

## Small twist!

API: https://www.tensorflow.org/api_docs/python/tf/data/Dataset

In [None]:
train_ds = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
test_ds = tf.data.Dataset.from_tensor_slices((X_test, Y_test))

In [None]:
def resize_images(image, label):
    # Normalize images to have a mean of 0 and standard deviation of 1
    image = tf.image.per_image_standardization(image)

    image = tf.image.resize(image, (94,94))
    return image, label

In [None]:
train_ds = (train_ds
                  .map(resize_images)
                  .shuffle(buffer_size=10000)
                  .batch(batch_size=64, drop_remainder=True))
test_ds = (test_ds
                  .map(resize_images)
                  .batch(batch_size=32, drop_remainder=False))

In [None]:
model.compile(loss='categorical_crossentropy',
             optimizer=tf.keras.optimizers.Adam(learning_rate=0.005),
             metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
tf.keras.utils.plot_model(model, "Xception.png", show_shapes=True)

## Begin training

In [None]:
model.fit(train_ds, epochs=2, batch_size=32, verbose=1)

## Predict on all the test data

In [None]:
predictions = model.predict(test_ds)

In [None]:
predictions.shape

In [None]:
correct_values = np.argmax(Y_test,axis=-1)
predicted_classes = np.argmax(predictions,axis=-1)

In [None]:
accuracy_score(predicted_classes,correct_values)*100