## Part III: Separable convolutions  [Mobilenet](https://arxiv.org/pdf/1704.04861.pdf) 
- Modify the baseline model in part I to use *separable* convolutions, similar to Mobilenet.
- Check number of parameters and compare with the baseline.
- Train classifier.


### Separable convolutions

For example, a 2D conv can be written as a sequence of 2 1D conv, i.e. \begin{equation} y[m,n]=x[m,n]*h[m,n] = h_1[m]*(h_2[n]*x[m,n])\end{equation}

assuming $x$ is a 2D input signal, $h$ is a 2D filter that can be separated into 2 1D filters $h_1$ and $h_2$, and $y$ is the output of convolving $x$ with $h$.  


Similarly for 3D case, we apply the separability between feature channel and spatial dimensions (as shown in the figure below on the left), i.e. \begin{equation} y[m,n,p]=x[m,n, p]*h[m,n,p] = h_1[p]*(h_2[m,n]*x[m,n,p])\end{equation}

![alt text](https://github.com/tmlss2018/PracticalSessions/blob/master/assets/separable.png?raw=true)

### Architecture
- baseline from Part I, adapted for separable convolutions; note that this is not exactly the model in Mobilenet paper. 
- first conv layer stays the same, kernel = (3,3), stride=1, num_channels = 64
- depthwise conv layers:


>* channel_multiplier = 1
>* strides = [1, 2, 1, 1, 2, 1, 1, 2, 1, 1]

- 1x1 conv layers:

>* num_channels = [128, 128, 128, 256, 256, 256, 512, 512, 512, 512]
>* strides = 1

- the depthwise and 1x1 conv layers are added after the first conv layer, alternating between the two types 
- padding: SAME (snt.SAME) for all layers
- num_output_classes = 10
- BatchNorm and ReLU are added after each conv layer as illustrated in the figure above

### Imports

In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import math
import time

import tensorflow as tf

# Don't forget to select GPU runtime environment in Runtime -> Change runtime type
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

# we will use Sonnet on top of TF 
!pip install -q dm-sonnet
import sonnet as snt

import numpy as np

# Plotting library.
from matplotlib import pyplot as plt
import pylab as pl
from IPython import display

In [0]:
# Reset graph
tf.reset_default_graph()

### Download dataset to be used for training and testing
- Cifar-10 equivalent of MNIST for natural RGB images
- 60000 32x32 colour images in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
- train: 50000; test: 10000

In [0]:
cifar10 = tf.keras.datasets.cifar10
# (down)load dataset
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

In [0]:
#@title Prepare the data for training and testing
# define dimension of the batches to sample from the datasets
BATCH_SIZE_TRAIN = 64 #@param
BATCH_SIZE_TEST = 100 #@param

# create Dataset objects using the data previously downloaded
dataset_train = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
# we shuffle the data and sample repeatedly batches for training
batched_dataset_train = dataset_train.shuffle(100000).repeat().batch(BATCH_SIZE_TRAIN)
# create iterator to retrieve batches
iterator_train = batched_dataset_train.make_one_shot_iterator()
# get a training batch of images and labels
(batch_train_images, batch_train_labels) = iterator_train.get_next()

# we do the same for test dataset
dataset_test = tf.data.Dataset.from_tensor_slices((test_images, test_labels))
batched_dataset_test = dataset_test.repeat().batch(BATCH_SIZE_TEST)
iterator_test = batched_dataset_test.make_one_shot_iterator() 
(batch_test_images, batch_test_labels) = iterator_test.get_next()

# Squeeze labels and convert from uint8 to int32 - required below by the loss op
batch_test_labels = tf.cast(tf.squeeze(batch_test_labels), tf.int32)
batch_train_labels = tf.cast(tf.squeeze(batch_train_labels), tf.int32)

In [0]:
#@title Preprocessing of data
# Data augmentation used for train preprocessing
# - scale image to [-1 , 1]
# - get a random crop
# - apply horizontal flip randomly

def train_image_preprocess(h, w, random_flip=True):
  """Image processing required for training the model."""
  
  def random_flip_left_right(image, flip_index, seed=None):
    shape = image.get_shape()
    if shape.ndims == 3 or shape.ndims is None:
      uniform_random = tf.random_uniform([], 0, 1.0, seed=seed)
      mirror_cond = tf.less(uniform_random, .5)
      result = tf.cond(
          mirror_cond,
          lambda: tf.reverse(image, [flip_index]),
          lambda: image
      )
      return result
    elif shape.ndims == 4:
      uniform_random = tf.random_uniform(
          [tf.shape(image)[0]], 0, 1.0, seed=seed
      )
      mirror_cond = tf.less(uniform_random, .5)
      return tf.where(
          mirror_cond,
          image,
          tf.map_fn(lambda x: tf.reverse(x, [flip_index]), image, dtype=image.dtype)
      )
    else:
      raise ValueError("\'image\' must have either 3 or 4 dimensions.")

  def fn(image):
    # Ensure the data is in range [-1, 1].
    image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    image = image * 2.0 - 1.0
    # Randomly choose a (h, w, 3) patch.
    image = tf.random_crop(image, size=(BATCH_SIZE_TRAIN, h, w, 3))
    # Randomly flip the image.
    image = random_flip_left_right(image, 2)
    return image

  return fn

# Test preprocessing: only scale to [-1,1].
def test_image_preprocess():
  def fn(image):
    image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    image = image * 2.0 - 1.0
    return image
  return fn

### Define classifier using separable convolutions

In [0]:
class Mobilenet(snt.AbstractModule):
  
  def __init__(self, num_classes, name="mobilenet"):
    super(Mobilenet, self).__init__(name=name)
    self._num_classes = num_classes
    self._output_channels_first_conv = 64
    self._output_channels_1x1 = [
        128, 128, 128, 256, 256, 256, 512, 512, 512, 512
    ]
    self._strides_dw = [1, 2, 1, 1, 2, 1, 1, 2, 1, 1]
    self._num_layers_dw = len(self._strides_dw)
    self._num_layers_1x1 = len(self._output_channels_1x1)
   
  def _build(self, inputs, is_training=None, test_local_stats=False):
    net = inputs
    # instantiate all the convolutional layers
    first_conv = snt.Conv2D(name="conv_2d_0",
                            output_channels=self._output_channels_first_conv,
                            kernel_shape=3,
                            stride=1,
                            padding=snt.SAME,
                            use_bias=True)
    
    # instantiate depthwise conv layers
    conv_layers_dw = [snt.DepthwiseConv2D(name="conv_dw_2d_{}".format(i),
                                          channel_multiplier=1,
                                          kernel_shape=3,
                                          stride=self._strides_dw[i],
                                          padding=snt.SAME,
                                          use_bias=True)
                      for i in xrange(self._num_layers_dw)]
    
    # instantiate 1x1 conv layers
    conv_layers_1x1 = [snt.Conv2D(name="conv_1x1_2d_{}".format(i),
                                  output_channels=self._output_channels_1x1[i],
                                  kernel_shape=1,
                                  stride=1,
                                  padding=snt.SAME,
                                  use_bias=True)
                       for i in xrange(self._num_layers_1x1)]
    # connect first layer to the graph, adding batch norm and non-linearity
    net = first_conv(net)
    bn = snt.BatchNorm(name="batch_norm_0")
    net = bn(net, is_training=is_training, test_local_stats=test_local_stats)
    net = tf.nn.relu(net)
    
    # connect the rest of the layers
    for i, (layer_dw, layer_1x1) in enumerate(zip(conv_layers_dw, conv_layers_1x1)):
      net = layer_dw(net)
      bn = snt.BatchNorm(name="batch_norm_{}_0".format(i))
      net = bn(net, is_training=is_training, test_local_stats=test_local_stats)
      net = tf.nn.relu(net)
      net = layer_1x1(net)
      bn = snt.BatchNorm(name="batch_norm_{}_1".format(i))
      net = bn(net, is_training=is_training, test_local_stats=test_local_stats)
      net = tf.nn.relu(net)      

    net = tf.reduce_mean(net, reduction_indices=[1, 2], keepdims=False,
                         name="avg_pool")

    logits = snt.Linear(self._num_classes)(net)

    return logits

In [0]:
#@title Get number of parameters in a scope by iterating through the trainable variables
def get_num_params(scope):
  total_parameters = 0
  for variable in tf.trainable_variables(scope):
    # shape is an array of tf.Dimension
    shape = variable.get_shape()
    variable_parameters = 1
    for dim in shape:
      variable_parameters *= dim.value
    total_parameters += variable_parameters
  return total_parameters

### Instantiate the model and connect to data 


In [0]:
# First define the preprocessing ops for the train/test data
crop_height = 24 #@param
cropt_width = 24 #@param
preprocess_fn_train = train_image_preprocess(crop_height, cropt_width)
preprocess_fn_test = test_image_preprocess()

num_classes = 10 #@param

In [0]:
# Instantiate the model
with tf.variable_scope("mobilenet"):
  mobilenet_model = Mobilenet(num_classes=10)

In [0]:
# Get predictions from the model; use the corresponding preprocess ops and is_training flag
predictions_mobilenet = mobilenet_model(preprocess_fn_train(batch_train_images), is_training=True)
print (predictions_mobilenet)

test_predictions_mobilenet = mobilenet_model(preprocess_fn_test(batch_test_images), is_training=False)
print (test_predictions_mobilenet)

### Get number of parameters and compare with baseline

In [0]:
# Can you obtain this number by hand?
print ("Total number of parameters of Mobilenet model")
print (get_num_params("mobilenet"))

In [0]:
# @title Setup training (same as for baseline)
def get_loss(logits=None, labels=None):
  # We reduce over batch dimension, to ensure the loss is a scalar.   
  return tf.reduce_mean(
      tf.nn.sparse_softmax_cross_entropy_with_logits(
          labels=labels, logits=logits))

# Define train and test loss functions
train_loss = get_loss(labels=batch_train_labels, logits=predictions_mobilenet)
test_loss = get_loss(labels=batch_test_labels, logits=test_predictions_mobilenet)

# For evaluation, we look at top_k_accuracy since it's easier to interpret; normally k=1 or k=5
def top_k_accuracy(k, labels, logits):
  in_top_k = tf.nn.in_top_k(predictions=tf.squeeze(logits), targets=labels, k=k)
  return tf.reduce_mean(tf.cast(in_top_k, tf.float32))

def get_optimizer(step):
  """Get the optimizer used for training."""
  lr_init = 0.1 # initial value for the learning rate
  lr_schedule = (40e3, 60e3, 80e3) # after how many iterations to reduce the learning rate
  lr_schedule = tf.to_int64(lr_schedule)
  lr_factor = 0.1 # reduce learning rate by this factor
  
  
  num_epochs = tf.reduce_sum(tf.to_float(step >= lr_schedule))
  lr = lr_init * lr_factor**num_epochs

  return tf.train.MomentumOptimizer(learning_rate=lr, momentum=0.9)

# Create a global step that is incremented during training; useful for e.g. learning rate annealing
global_step = tf.train.get_or_create_global_step()

# instantiate the optimizer
optimizer = get_optimizer(global_step)

# Get training ops
training_mobilenet_op = optimizer.minimize(train_loss, global_step)

# Retrieve the update ops, which contain the moving average ops
update_ops = tf.group(*tf.get_collection(tf.GraphKeys.UPDATE_OPS))

# Manually add the update ops to the dependency path executed at each training iteration
training_mobilenet_op = tf.group(training_mobilenet_op, update_ops)

# Get test ops
test_acc_mobilenet_op = top_k_accuracy(1, batch_test_labels, test_predictions_mobilenet)

# Function that takes a list of losses and plots them.
def plot_losses(loss_list, steps):
  display.clear_output(wait=True)
  display.display(pl.gcf())
  pl.plot(steps, loss_list, c='b')
  time.sleep(1.0)

### Training params


In [0]:
# Define number of training iterations and reporting intervals
TRAIN_ITERS = 90e3 #@param
REPORT_TRAIN_EVERY = 10 #@param
PLOT_EVERY = 500 #@param
REPORT_TEST_EVERY = 1000 #@param
TEST_ITERS = 50 #@param

### Train the model. Full training gives ~87% accuracy.

In [0]:
# Create the session and initialize variables
sess = tf.Session()
sess.run(tf.global_variables_initializer())

train_iter = 0
losses = []
steps = []
for train_iter in range(int(TRAIN_ITERS)):
  _, train_loss_np = sess.run([training_mobilenet_op, train_loss])
  
  if (train_iter % REPORT_TRAIN_EVERY) == 0:
    losses.append(train_loss_np)
    steps.append(train_iter)
  if (train_iter % PLOT_EVERY) == 0:
    plot_losses(losses, steps)    
    
  if (train_iter % REPORT_TEST_EVERY) == 0:
    avg_acc = 0.0
    for test_iter in range(TEST_ITERS):
      acc = sess.run(test_acc_mobilenet_op)
      avg_acc += acc
      
    avg_acc /= (TEST_ITERS)
    print ('Test acc at iter {0:5d} out of {1:5d} is {2:.2f}%'.format(int(train_iter), int(TRAIN_ITERS), avg_acc*100.0))