In [2]:
# Checking for GPU availability on Colab
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [3]:
!python --version

Python 3.6.7


In [4]:
!rm -r id2223-lab2 MNIST_data test_images

rm: cannot remove 'id2223-lab2': No such file or directory
rm: cannot remove 'MNIST_data': No such file or directory
rm: cannot remove 'test_images': No such file or directory


In [0]:
import numpy as np
# import tensorflow as tf
from __future__ import division, print_function, unicode_literals

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [6]:
!git clone https://github.com/ssheikholeslami/id2223-lab2.git

Cloning into 'id2223-lab2'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 34 (delta 2), reused 34 (delta 2), pack-reused 0[K
Unpacking objects: 100% (34/34), done.


In [0]:
!mv id2223-lab2/test_images ./

# Tensorflow and Deep Learning

In this lab assignment, first you will learn how to build and train a neural network that recognises handwritten digits, and then you will build LeNet-5 CNN architecture, which is widely used for handwritten digit recognition. At the end of this lab assignment, you will make AlexNet CNN architecture, which won the 2012 ImageNet ILSVRC challenge.

---
# 1. Dataset
In the first part of the assignment, we use the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. There are 70,000 images, and each image has 784 features. This is because each image is 28×28=784 pixels, and each feature simply represents one pixel's intensity, from 0 (white) to 255 (black). The following figure shows a few images from the MNIST dataset to give you a feel for the complexity of the classification task.

<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/1-mnist.png" style="width: 300px;"/>

To begin the assignment, first, use `mnist_data.read_data_sets` and download images and labels. It return two lists, called `mnist.test` with 10K images+labels, and `mnist.train` with 60K images+labels.

In [8]:
# TODO: Replace <FILL IN> with appropriate code

from tensorflow.examples.tutorials.mnist import input_data as mnist_data

mnist = mnist_data.read_data_sets("MNIST_data/", one_hot= True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py fr

---
# 2. A One-Layer Neural Network
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/2-comic1.png" style="width: 500px;"/>

Let's start by building a one-layer neural network. Handwritten digits in the MNIST dataset are 28x28 pixel greyscale images. The simplest approach for classifying them is to use the 28x28=784 pixels as inputs for a **one-layer neural network**. Each neuron in the network does a weighted sum of all of its inputs, adds a bias and then feeds the result through some non-linear activation function. Here we design a one-layer neural network with 10 output neurons since we want to classify digits into 10 classes (0 to 9).

<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/3-one_layer.png" style="width: 400px;"/>


For a classification problem, an *activation function* that works well is **softmax**. Applying softmax on a vector is done by taking the exponential of each element and then normalising the vector.

<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/4-softmax.png" style="width: 300px;"/>

We can summarise the behaviour of this single layer of neurons into a simple formula using a *matrix multiply*. If we give input data into the network in *mini-batch* of 100 images, it produces 100 predictions as the output. We define the **weights matrix $W$** with 10 columns, in which each column indicates the weight of a one class (a single digit), from 0 to 9. Using the first column of $W$, we can compute the weighted sum of all the pixels of the first image. This sum corresponds to the first neuron that points to the number 0. Using the second column of $W$, we do the same for the second neuron (number 1) and so on until the 10th neuron. We can then repeat the operation for the remaining 99 images in the mini-batch. If we call $X$ the matrix containing our 100 images (each row corresponds to one digit), all the weighted sums for our 10 neurons, computed on 100 images are simply $X.W$. Each neuron must now add its bias. Since we have 10 neurons, we have 10 bias constants. We finally apply the **softmax activation function** and obtain the formula describing a one-layer neural network, applied to 100 images.

<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/5-xw.png" style="width: 600px;"/>
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/6-softmax2.png" style="width: 500px;"/>

Then, we need to use the **cross-entropy** to measure how good the predictions are, i.e., the distance between what the network tells us and what we know to be the truth. The cross-entropy is a function of weights, biases, pixels of the training image and its known label. If we compute the partial derivatives of the cross-entropy relatively to all the weights and all the biases, we obtain a **gradient**, computed for a given image, label and present value of weights and biases. We can update weights and biases by a fraction of the gradient and do the same thing again using the next batch of training images.

<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/7-cross_entropy.png" style="width: 600px;"/>

### Define Variables and Placeholders
First we define TensorFlow **variables** and **placeholders**. *Variables* are all the parameters that you want the training algorithm to determine for you (e.g., weights and biases). *Placeholders* are parameters that will be filled with actual data during training (e.g., training images). The shape of the tensor holding the training images is [None, 28, 28, 1] which stands for:
  - 28, 28, 1: our images are 28x28 (784) pixels x 1 value per pixel (grayscale). The last number would be 3 for color images and is not really necessary here.
  - None: this dimension will be the number of images in the mini-batch. It will be known at training time.

We also need an additional placeholder for the training labels that will be provided alongside training images.

In [0]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 1 layer of 10 softmax neurons
#
# · · · · · · · · · ·       (input data, flattened pixels)       X [batch, 784] 
# \x/x\x/x\x/x\x/x\x/    -- fully connected layer (softmax)      W [784, 10]     b[10]
#   · · · · · · · ·                                              Y_hat [batch, 10]

# input X: 28x28 grayscale images, the first dimension (None) will index the images in the mini-batch
X = tf.placeholder(tf.float32, [None, 28, 28, 1]) # None: number of images in the mini-batch, 28*28 pixels, 1 value per pixel

# correct answers will go here
Y = tf.placeholder(tf.float32, [None, 10])

# weights W[784, 10], 784 = 28 * 28
W = tf.Variable(tf.zeros([784, 10]))

# biases b[10]
b = tf.Variable(tf.zeros([10]))

### Build The Model
Now, we can make a **model** for a one-layer neural network. The formula is the one we explained before, i.e., $\hat{Y} = softmax(X . W + b)$. You can use the `tf.nn.softmax` and `tf.matmul` to build the model. Here, we need to use the `tf.reshape` to transform our 28x28 images into single vectors of 784 pixels.

In [0]:

# flatten the images into a single line of pixels
XX = tf.reshape(X, [-1, 784])
# -1 means the number of rows would correspond to number of images per mini-batch
# and it will be figured out during execution



# The model
Y_hat = tf.nn.softmax(tf.matmul(XX, W) + b)

### Define The Cost Function
Now, we have model predictions $\hat{Y}$ and correct labels $Y$, so for each instance $i$ (image) we can compute the cross-entropy as the **cost function**: $cross\_entropy = -\sum(Y_i * log(\hat{Y}i))$. You can use `reduce_mean` to add all the components in a tensor.

In [0]:
cross_entropy = -tf.reduce_sum(Y * tf.log(Y_hat))

In [0]:
is_correct = tf.equal(tf.argmax(Y_hat,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

### Train the Model
Now, select the gradient descent optimiser `GradientDescentOptimizer` and ask it to minimise the cross-entropy cost. In this step, TensorFlow computes the partial derivatives of the cost function relatively to all the weights and all the biases (the gradient). The gradient is then used to update the weights and biases. Set the learning rate is $0.005$.

In [0]:
optimizer = tf.train.GradientDescentOptimizer(0.005)
train_step = optimizer.minimize(cross_entropy)

### Execute the Model
It is time to run the training loop. All the TensorFlow instructions up to this point have been preparing a computation graph in memory but nothing has been computed yet. The computation requires actual data to be fed into the placeholders. This is supplied in the form of a Python dictionary, where the keys are the names of the placeholders. During the trainig print out the cost every 200 steps. Moreove, after training the model, print out the accurray of the model by testing it on the test data.

In [14]:
# init
init = tf.global_variables_initializer()

n_epochs = 5000
with tf.Session() as sess:
  
  init.run()
  for i in range(n_epochs):
    batch_X, batch_Y = mnist.train.next_batch(100)
    
    training_data = {XX: batch_X, Y: batch_Y}
    sess.run([train_step], feed_dict=training_data)
    #sess.run(train_step, feed_dict=training_data)
    if i%200 == 0:
      cost = sess.run([cross_entropy], feed_dict=training_data)
      print("iteration {0}, cost is {1}".format(i, cost))
  
  test_data={XX: mnist.test.images, Y: mnist.test.labels}

  a = sess.run([accuracy], feed_dict=test_data)
  print("Accuracy on the test set is: {0}".format(a))

iteration 0, cost is [163.45651]
iteration 200, cost is [32.430332]
iteration 400, cost is [24.468834]
iteration 600, cost is [19.25543]
iteration 800, cost is [18.332882]
iteration 1000, cost is [34.8919]
iteration 1200, cost is [15.3029585]
iteration 1400, cost is [16.054102]
iteration 1600, cost is [17.40675]
iteration 1800, cost is [20.477411]
iteration 2000, cost is [27.12159]
iteration 2200, cost is [32.961174]
iteration 2400, cost is [24.82403]
iteration 2600, cost is [14.263334]
iteration 2800, cost is [29.216372]
iteration 3000, cost is [18.719868]
iteration 3200, cost is [16.40898]
iteration 3400, cost is [13.435759]
iteration 3600, cost is [26.640228]
iteration 3800, cost is [21.806126]
iteration 4000, cost is [18.824202]
iteration 4200, cost is [24.382624]
iteration 4400, cost is [19.329655]
iteration 4600, cost is [19.496109]
iteration 4800, cost is [32.17187]
Accuracy on the test set is: [0.9228]


---
# 3. Add More Layers

<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/8-comic2.png" style="width: 500px;"/>

Now, let's improve the recognition accuracy by adding more layers to the neural network. The neurons in the second layer, instead of computing weighted sums of pixels will compute weighted sums of neuron outputs from the previous layer. We keep the softmax function as the activation function on the last layer, but on intermediate layers we will use the the **sigmoid** activation function. So, let's build a five-layer fully connected neural network with the following structure, and train the model with the trainging data and print out its accuracy on the test data.
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/9-five_layer.png" style="width: 500px;"/>

In [15]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with five layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1 [200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2 [100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3 [60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4 [30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5 [10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################

X = tf.placeholder(tf.float32, [None, 28, 28, 1]) # None: number of images in the mini-batch, 28*28 pixels, 1 value per pixel
Y = tf.placeholder(tf.float32, [None, 10])


# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10

# initializing weights with random values (tf.truncated_normal)
#  normal Guassian distribution, range of -2*stddev and +2*stddev.
W1 = tf.Variable(tf.truncated_normal([784, 200], stddev=0.1))
B1 = tf.Variable(tf.zeros([200]))

W2 = tf.Variable(tf.truncated_normal([200, 100], stddev=0.1))
B2 = tf.Variable(tf.zeros([100]))

W3 = tf.Variable(tf.truncated_normal([100, 60], stddev=0.1))
B3 = tf.Variable(tf.zeros([60]))

W4 = tf.Variable(tf.truncated_normal([60, 30], stddev=0.1))
B4 = tf.Variable(tf.zeros([30]))

W5 = tf.Variable(tf.truncated_normal([30, 10], stddev=0.1))
B5 = tf.Variable(tf.zeros([10]))

########################################
# build the model
########################################
XX = tf.reshape(X, [-1, 784])

Y1_hat = tf.nn.sigmoid(tf.matmul(XX, W1) + B1)
Y2_hat = tf.nn.sigmoid(tf.matmul(Y1_hat, W2) + B2)
Y3_hat = tf.nn.sigmoid(tf.matmul(Y2_hat, W3) + B3)
Y4_hat = tf.nn.sigmoid(tf.matmul(Y3_hat, W4) + B4)
Y_hat = tf.nn.softmax(tf.matmul(Y4_hat, W5) + B5)

########################################
# define the cost function
########################################
cross_entropy = -tf.reduce_sum(Y * tf.log(Y_hat))

# for accuracy
is_correct = tf.equal(tf.argmax(Y_hat,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

########################################
# define the optimizer
########################################
optimizer = tf.train.GradientDescentOptimizer(0.005)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

n_epochs = 5000
with tf.Session() as sess:
  
  init.run()
  for i in range(n_epochs):
    batch_X, batch_Y = mnist.train.next_batch(100)
    
    training_data = {XX: batch_X, Y: batch_Y}
    sess.run([train_step], feed_dict=training_data)
    #sess.run(train_step, feed_dict=training_data)
    if i%200 == 0:
      cost = sess.run([cross_entropy], feed_dict=training_data)
      print("iteration {0}, cost is {1}".format(i, cost))
  
  test_data={XX: mnist.test.images, Y: mnist.test.labels}

  a = sess.run([accuracy], feed_dict=test_data)
  print("Accuracy on the test set is: {0}".format(a))

iteration 0, cost is [230.12592]
iteration 200, cost is [226.64302]
iteration 400, cost is [228.18521]
iteration 600, cost is [228.69884]
iteration 800, cost is [228.17453]
iteration 1000, cost is [176.64536]
iteration 1200, cost is [144.08676]
iteration 1400, cost is [144.2821]
iteration 1600, cost is [102.24237]
iteration 1800, cost is [71.152054]
iteration 2000, cost is [61.99971]
iteration 2200, cost is [65.786514]
iteration 2400, cost is [46.310677]
iteration 2600, cost is [65.36499]
iteration 2800, cost is [21.10183]
iteration 3000, cost is [17.7791]
iteration 3200, cost is [23.593304]
iteration 3400, cost is [22.3067]
iteration 3600, cost is [10.572483]
iteration 3800, cost is [31.099455]
iteration 4000, cost is [11.542251]
iteration 4200, cost is [13.253819]
iteration 4400, cost is [7.9329395]
iteration 4600, cost is [7.70706]
iteration 4800, cost is [8.353289]
Accuracy on the test set is: [0.9555]


---
# 4. Special Care for Deep Networks
As layers were added, neural networks tended to converge with more difficulties. For example, the accuracy could stuck at 0.1. Here, we want to apply some updates to the network we built in the previous part to improve its performance. 

### ReLU Activation Function
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/10-comic3.png" style="width: 500px;"/>
The sigmoid activation function is actually quite problematic in deep networks. It squashes all values between 0 and 1 and when you do so repeatedly, neuron outputs and their gradients can vanish entirely. An alternative activation function is **ReLU** that shows better performance compare to sigmoid. It looks like as below:
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/11-relu.png" style="width: 300px;"/>

### A Better Optimizer
In very high dimensional spaces like here, **saddle points** are frequent. These are points that are not local minima, but where the gradient is nevertheless zero and the gradient descent optimizer stays stuck there. One possible solution to tackle this probelm is to use better optimizers, such as Adam optimizer `tf.train.AdamOptimizer`.


<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/saddle-point-1.png" style="width:300px;">
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/saddle-point-2.png" style="width:300px;">

### Random Initialisations
When working with ReLUs, the best practice is to initialise bias values to small positive values, so that neurons operate in the non-zero range of the ReLU initially.

### Learning Rate
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/12-comic4.png" style="width: 500px;"/>
With two, three or four intermediate layers, you can now get close to 98% accuracy, if you push the iterations to 5000 or beyond. But, the results are not very consistent, and the curves jump up and down by a whole percent. A good solution is to start fast and decay the learning rate exponentially from $0.005$ to $0.0001$ for example. In order to pass a different learning rate to the `AdamOptimizer` at each iteration, you will need to define a new placeholder and feed it a new value at each iteration through `feed_dict`. Here is the formula for exponential decay: $learning\_rate = lr\_min + (lr\_max - lr\_min) * e^{\frac{-i}{2000}}$, where $i$ is the iteration number.

### NaN?
In the network you built in the last section, you might see accuracy curve crashes and the console outputs NaN for the cross-entropy. It may happen, because you are attempting to compute a $log(0)$, which is indeed Not A Number (NaN). Remember that the cross-entropy involves a log, computed on the output of the softmax layer. Since softmax is essentially an exponential, which is never zero, we should be fine, but with 32 bit precision floating-point operations, exp(-100) is already a genuine zero. TensorFlow has a handy function that computes the softmax and the cross-entropy in a single step, implemented in a numerically stable way. To use it, you will need to separate the weighted sum plus bias on the last layer, before softmax is applied and then give it with the true values to the function `tf.nn.softmax_cross_entropy_with_logits`.

In the code below, apply the following changes and show their impact on the accuracy of the model on training data, as well as the test data:
* Replace the sigmoid activation function with ReLU
* Use the Adam optimizer
* Initialize weights with small random values between -0.2 and +0.2, and make sure biases are initialised with small positive values, for example 0.1
* Update the learning rate in different iterations. Start fast and decay the learning rate exponentially from $0.005$ to $0.0001$, i.e., 
```
max_learning_rate = 0.005
min_learning_rate = 0.0001
decay_speed = 2000.0
```
* Use `tf.nn.softmax_cross_entropy_with_logits` to prevent getting NaN in output.

In [16]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1]) # None: number of images in the mini-batch, 28*28 pixels, 1 value per pixel
Y = tf.placeholder(tf.float32, [None, 10])


# variable learning rate
lr_max = 0.005
lr_min = 0.0001
decay_speed = tf.constant(2000.0)
step = tf.placeholder(tf.float32)
# ASK: what if I want to use tf.int32 for the step? then, how should I divide for (-step/decay_speed) ?
learning_rate = lr_min + (lr_max - lr_min) * tf.math.exp(tf.math.divide(-step, decay_speed))

# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W1 = tf.Variable(tf.truncated_normal([784, 200], stddev=0.1))
B1 = tf.Variable(tf.ones([200])/10)

W2 = tf.Variable(tf.truncated_normal([200, 100], stddev=0.1))
B2 = tf.Variable(tf.ones([100])/10)

W3 = tf.Variable(tf.truncated_normal([100, 60], stddev=0.1))
B3 = tf.Variable(tf.ones([60])/10)

W4 = tf.Variable(tf.truncated_normal([60, 30], stddev=0.1))
B4 = tf.Variable(tf.ones([30])/10)

W5 = tf.Variable(tf.truncated_normal([30, 10], stddev=0.1))
B5 = tf.Variable(tf.ones([10])/10)


########################################
# build the model
########################################
XX = tf.reshape(X, [-1, 784])

Y1_hat = tf.nn.relu(tf.matmul(XX, W1) + B1)
Y2_hat = tf.nn.relu(tf.matmul(Y1_hat, W2) + B2)
Y3_hat = tf.nn.relu(tf.matmul(Y2_hat, W3) + B3)
Y4_hat = tf.nn.relu(tf.matmul(Y3_hat, W4) + B4)

Y_logits = tf.matmul(Y4_hat, W5) + B5
Y_hat = tf.nn.softmax(Y_logits)

########################################
# defining the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=Y_logits, labels=Y)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

# for accuracy
is_correct = tf.equal(tf.argmax(Y_hat,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

########################################
# define the optimizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)


########################################
# execute the model
########################################
init = tf.global_variables_initializer()

n_epochs = 5000
with tf.Session() as sess:
  
  init.run()
  for i in range(n_epochs):
    batch_X, batch_Y = mnist.train.next_batch(100)
    
    sess.run([train_step], feed_dict={XX: batch_X, Y: batch_Y, step: i})
    if i%200 == 0:
      cost = sess.run([cross_entropy], feed_dict={XX: batch_X, Y: batch_Y})
      print("iteration {0}, cost is {1}".format(i, cost))
  
  test_data={XX: mnist.test.images, Y: mnist.test.labels}

  a = sess.run([accuracy], feed_dict=test_data)
  print("Accuracy on the test set is: {0}".format(a))

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

iteration 0, cost is [222.66193]
iteration 200, cost is [26.622381]
iteration 400, cost is [16.79353]
iteration 600, cost is [2.7033508]
iteration 800, cost is [4.8881645]
iteration 1000, cost is [9.665116]
iteration 1200, cost is [4.4661536]
iteration 1400, cost is [1.2134466]
iteration 1600, cost is [1.1841532]
iteration 1800, cost is [2.0809076]
iteration 2000, cost is [1.373926]
iteration 2200, cost is [4.185578]
iteration 2400, cost is [0.20083857]
iteration 2600, cost is [0.42323712]
iteration 2800, cost is [0.39059728]
iteration 3000, cost is [0.7292261]
iteration 3200, cost is [0.09952889]
iteration 3400, cost is [0.515056]
iteration 3600, cost is [0.041083]
iteration 3800, cost is [0.095856175]
iteration 4000, cost is [0.5986222]
iteration 4200, cost is [0.24826486]
iteration 4400, cost is

---
# 5. Overfitting and Dropout
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/13-comic5.png" style="width: 500px;"/>
You will have noticed that cross-entropy curves for test and training data start disconnecting after a couple thousand iterations. The learning algorithm works on training data only and optimises the training cross-entropy accordingly. It never sees test data so it is not surprising that after a while its work no longer has an effect on the test cross-entropy which stops dropping and sometimes even bounces back up. 
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/14-overfit.png" style="width: 500px;"/>
This disconnect is usually labeled **overfitting** and when you see it, you can try to apply a regularisation technique called **dropout**. In dropout, at each training iteration, you drop random neurons from the network. You choose a probability `pkeep` for a neuron to be kept, usually between 50% and 75%, and then at each iteration of the training loop, you randomly remove neurons with all their weights and biases. Different neurons will be dropped at each iteration. When testing the performance of your network of course you put all the neurons back (`pkeep = 1`).
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/15-dropout.png" style="width: 500px;"/>
TensorFlow offers a dropout function to be used on the outputs of a layer of neurons. It randomly zeroes-out some of the outputs and boosts the remaining ones by `1 / pkeep`. You can add dropout after each intermediate layer in the network now. 

In the following code, use the dropout between each layer during the training, and set the probability `pkeep` once to $50%$ and another time to $75%$ and compare their results.

In [17]:
# DROPOUT pkeep 75%
# TODO: Replace <FILL IN> with appropriate code

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1]) # None: number of images in the mini-batch, 28*28 pixels, 1 value per pixel
Y = tf.placeholder(tf.float32, [None, 10])

# variable learning rate
lr_max = 0.005
lr_min = 0.0001
decay_speed = tf.constant(2000.0)
step = tf.placeholder(tf.float32)
learning_rate = lr_min + (lr_max - lr_min) * tf.math.exp(tf.math.divide(-step, decay_speed))


# probability of keeping a node during dropout = 1.0 at test time (no dropout) and 0.75 at training time
pkeep = 0.75

# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W1 = tf.Variable(tf.truncated_normal([784, 200], stddev=0.1))
B1 = tf.Variable(tf.ones([200])/10)

W2 = tf.Variable(tf.truncated_normal([200, 100], stddev=0.1))
B2 = tf.Variable(tf.ones([100])/10)

W3 = tf.Variable(tf.truncated_normal([100, 60], stddev=0.1))
B3 = tf.Variable(tf.ones([60])/10)

W4 = tf.Variable(tf.truncated_normal([60, 30], stddev=0.1))
B4 = tf.Variable(tf.ones([30])/10)

W5 = tf.Variable(tf.truncated_normal([30, 10], stddev=0.1))
B5 = tf.Variable(tf.ones([10])/10)

########################################
# build the model
########################################

XX = tf.reshape(X, [-1, 784])

Y1_hat = tf.nn.relu(tf.matmul(XX, W1) + B1)
Y2_hat = tf.nn.relu(tf.matmul(Y1_hat, W2) + B2)
Y3_hat = tf.nn.relu(tf.matmul(Y2_hat, W3) + B3)
Y4_hat = tf.nn.relu(tf.matmul(Y3_hat, W4) + B4)

Y_logits = tf.matmul(Y4_hat, W5) + B5
Y_hat = tf.nn.softmax(Y_logits)

###

XX = tf.reshape(X, [-1, 784])

Y1_hat = tf.nn.relu(tf.matmul(XX, W1) + B1)
Y1_hat_dropout = tf.nn.dropout(Y1_hat, pkeep)

Y2_hat = tf.nn.relu(tf.matmul(Y1_hat_dropout, W2) + B2)
Y2_hat_dropout = tf.nn.dropout(Y2_hat, pkeep)

Y3_hat = tf.nn.relu(tf.matmul(Y2_hat_dropout, W3) + B3)
Y3_hat_dropout = tf.nn.dropout(Y3_hat, pkeep)

Y4_hat = tf.nn.relu(tf.matmul(Y3_hat_dropout, W4) + B4)
Y4_hat_dropout = tf.nn.dropout(Y4_hat, pkeep)

Y_logits = tf.matmul(Y4_hat_dropout, W5) + B5
Y_hat = tf.nn.softmax(Y_logits)


########################################
# define the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=Y_logits, labels=Y)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

# for accuracy
is_correct = tf.equal(tf.argmax(Y_hat,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))


########################################
# define the optimizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)


########################################
# execute the model
########################################
init = tf.global_variables_initializer()

n_epochs = 5000
with tf.Session() as sess:
  
  init.run()
  for i in range(n_epochs):
    batch_X, batch_Y = mnist.train.next_batch(100)
    
    sess.run([train_step], feed_dict={XX: batch_X, Y: batch_Y, step: i})
    if i%200 == 0:
      cost = sess.run([cross_entropy], feed_dict={XX: batch_X, Y: batch_Y})
      print("iteration {0}, cost is {1}".format(i, cost))
  
  test_data={XX: mnist.test.images, Y: mnist.test.labels}

  a = sess.run([accuracy], feed_dict=test_data)
  print("Accuracy on the test set is: {0}".format(a))

iteration 0, cost is [220.79858]
iteration 200, cost is [48.671577]
iteration 400, cost is [17.606796]
iteration 600, cost is [44.63339]
iteration 800, cost is [12.162537]
iteration 1000, cost is [12.537945]
iteration 1200, cost is [14.779859]
iteration 1400, cost is [20.719618]
iteration 1600, cost is [8.574167]
iteration 1800, cost is [8.994978]
iteration 2000, cost is [7.3451586]
iteration 2200, cost is [7.400474]
iteration 2400, cost is [10.00017]
iteration 2600, cost is [9.237671]
iteration 2800, cost is [14.582441]
iteration 3000, cost is [6.6937165]
iteration 3200, cost is [5.4167805]
iteration 3400, cost is [6.294631]
iteration 3600, cost is [8.543221]
iteration 3800, cost is [3.9466403]
iteration 4000, cost is [22.447117]
iteration 4200, cost is [4.9647303]
iteration 4400, cost is [7.2080264]
iteration 4600, cost is [2.6475177]
iteration 4800, cost is [5.781323]
Accuracy on the test set is: [0.9691]


In [18]:
# DROPOUT pkeep 50%
# TODO: Replace <FILL IN> with appropriate code

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1]) # None: number of images in the mini-batch, 28*28 pixels, 1 value per pixel
Y = tf.placeholder(tf.float32, [None, 10])

# variable learning rate
lr_max = 0.005
lr_min = 0.0001
decay_speed = tf.constant(2000.0)
step = tf.placeholder(tf.float32)
learning_rate = lr_min + (lr_max - lr_min) * tf.math.exp(tf.math.divide(-step, decay_speed))


# probability of keeping a node during dropout = 1.0 at test time (no dropout) and 0.50 at training time
pkeep = 0.50 # or you can use a tf.placeholder() and feed the value in session execution time


# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W1 = tf.Variable(tf.truncated_normal([784, 200], stddev=0.1))
B1 = tf.Variable(tf.ones([200])/10)

W2 = tf.Variable(tf.truncated_normal([200, 100], stddev=0.1))
B2 = tf.Variable(tf.ones([100])/10)

W3 = tf.Variable(tf.truncated_normal([100, 60], stddev=0.1))
B3 = tf.Variable(tf.ones([60])/10)

W4 = tf.Variable(tf.truncated_normal([60, 30], stddev=0.1))
B4 = tf.Variable(tf.ones([30])/10)

W5 = tf.Variable(tf.truncated_normal([30, 10], stddev=0.1))
B5 = tf.Variable(tf.ones([10])/10)

########################################
# build the model
########################################

XX = tf.reshape(X, [-1, 784])

Y1_hat = tf.nn.relu(tf.matmul(XX, W1) + B1)
Y2_hat = tf.nn.relu(tf.matmul(Y1_hat, W2) + B2)
Y3_hat = tf.nn.relu(tf.matmul(Y2_hat, W3) + B3)
Y4_hat = tf.nn.relu(tf.matmul(Y3_hat, W4) + B4)

Y_logits = tf.matmul(Y4_hat, W5) + B5
Y_hat = tf.nn.softmax(Y_logits)

###

XX = tf.reshape(X, [-1, 784])

Y1_hat = tf.nn.relu(tf.matmul(XX, W1) + B1)
Y1_hat_dropout = tf.nn.dropout(Y1_hat, pkeep)

Y2_hat = tf.nn.relu(tf.matmul(Y1_hat_dropout, W2) + B2)
Y2_hat_dropout = tf.nn.dropout(Y2_hat, pkeep)

Y3_hat = tf.nn.relu(tf.matmul(Y2_hat_dropout, W3) + B3)
Y3_hat_dropout = tf.nn.dropout(Y3_hat, pkeep)

Y4_hat = tf.nn.relu(tf.matmul(Y3_hat_dropout, W4) + B4)
Y4_hat_dropout = tf.nn.dropout(Y4_hat, pkeep)

Y_logits = tf.matmul(Y4_hat_dropout, W5) + B5
Y_hat = tf.nn.softmax(Y_logits)


########################################
# define the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=Y_logits, labels=Y)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

# for accuracy
is_correct = tf.equal(tf.argmax(Y_hat,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))


########################################
# define the optimizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)


########################################
# execute the model
########################################
init = tf.global_variables_initializer()

n_epochs = 5000
with tf.Session() as sess:
  
  init.run()
  for i in range(n_epochs):
    batch_X, batch_Y = mnist.train.next_batch(100)
    
    sess.run([train_step], feed_dict={XX: batch_X, Y: batch_Y, step: i})
    if i%200 == 0:
      cost = sess.run([cross_entropy], feed_dict={XX: batch_X, Y: batch_Y})
      print("iteration {0}, cost is {1}".format(i, cost))
  
  test_data={XX: mnist.test.images, Y: mnist.test.labels}

  a = sess.run([accuracy], feed_dict=test_data)
  print("Accuracy on the test set is: {0}".format(a))

iteration 0, cost is [229.93216]
iteration 200, cost is [111.02097]
iteration 400, cost is [85.23444]
iteration 600, cost is [64.1103]
iteration 800, cost is [75.219185]
iteration 1000, cost is [80.99427]
iteration 1200, cost is [35.30992]
iteration 1400, cost is [51.995163]
iteration 1600, cost is [38.948914]
iteration 1800, cost is [45.783844]
iteration 2000, cost is [27.992756]
iteration 2200, cost is [30.026806]
iteration 2400, cost is [50.836666]
iteration 2600, cost is [22.931337]
iteration 2800, cost is [25.898378]
iteration 3000, cost is [31.93489]
iteration 3200, cost is [34.734787]
iteration 3400, cost is [26.913143]
iteration 3600, cost is [58.85297]
iteration 3800, cost is [60.725792]
iteration 4000, cost is [20.00707]
iteration 4200, cost is [28.569174]
iteration 4400, cost is [13.7928705]
iteration 4600, cost is [29.772747]
iteration 4800, cost is [24.043648]
Accuracy on the test set is: [0.9166]


**Our Comparison:** Seems like less dropout (higher activation probability) might be better - actually, a lower dropout means that we are harnessing the full potential of our layers.

---
# 6. Convolutional Network
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/16-comic6.png" style="width: 500px;"/>
In the previous sections, all pixels of images flattened into a single vector, which was a really bad idea. Handwritten digits are made of shapes and we discarded the shape information when we flattened the pixels. However, we can use **convolutional neural networks (CNN)** to take advantage of shape information. CNNs apply *a series of filters* to the raw pixel data of an image to extract and learn higher-level features, which the model can then use for classification. CNNs contains three components:
  - **Convolutional layers**: apply a specified number of convolution filters to the image. For each subregion, the layer performs a set of mathematical operations to produce a single value in the output feature map. Convolutional layers then typically apply a ReLU activation function to the output to introduce nonlinearities into the model.
  - **Pooling layers**: downsample the image data extracted by the convolutional layers to reduce the dimensionality of the feature map in order to decrease processing time. A commonly used pooling algorithm is max pooling, which extracts subregions of the feature map (e.g., 2x2-pixel tiles), keeps their maximum value, and discards all other values.
  - **Dense (fully connected) layers**: perform classification on the features extracted by the convolutional layers and downsampled by the pooling layers. In a dense layer, every node in the layer is connected to every node in the preceding layer.
  
Typically, a CNN is composed of a *stack of **convolutional modules*** that perform feature extraction. Each *module* consists of a *convolutional layer* followed by a *pooling layer*. The last convolutional module is followed by one or more dense layers that perform classification. The final dense layer in a CNN contains a single neuron for each target class in the model, with a softmax activation function to generate a value between 0-1 for each neuron. We can interpret the softmax values for a given image as relative measurements of how likely it is that the image falls into each target class.

Now, let us build a convolutional network for handwritten digit recognition. In this assignment, we will use the architecture shown in the following figure that has three convolutional layers, one fully-connected layer, and one softmax layer. Notice that the second and third convolutional layers have a stride of two that explains why they bring the number of output values down from 28x28 to 14x14 and then 7x7. A convolutional layer requires a weights tensor like `[4, 4, 3, 2]`, in which the first two numbers define the size of a filter (map), the third number shows the *depth* of the filter that is the number of *input channel*, and the last number shows the number of *output channel*. The output channel defines the number of times that we repeat the same thing with a different set of weights in one layer. In our implementation, we assume the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected layer is 200.
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/17-arch1.png" style="width: 600px;"/>

Convolutional layers can be implemented in TensorFlow using the `tf.nn.conv2d` function, which performs the scanning of the input image in both directions using the supplied weights. This is only the weighted sum part of the neuron. You still need to add a bias and feed the result through an activation function. The padding strategy that works here is to copy pixels from the sides of the image. All digits are on a uniform background so this just extends the background and should not add any unwanted shapes.

In [0]:
# TODO: Replace <FILL IN> with appropriate code

# · · · · · · · · · ·      (input data, 1-deep)               X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @   -- conv. layer 5x5x1=>4 stride 1      W1 [5, 5, 1, 4]        B1 [4]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 4]
#   @ @ @ @ @ @ @ @     -- conv. layer 5x5x4=>8 stride 2      W2 [5, 5, 4, 8]        B2 [8]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 8]
#     @ @ @ @ @ @       -- conv. layer 4x4x8=>12 stride 2     W3 [4, 4, 8, 12]       B3 [12]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 12] => reshaped to YY [batch, 7*7*12]
#      \x/x\x\x/        -- fully connected layer (relu)       W4 [7*7*12, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/         -- fully connected layer (softmax)    W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = <FILL IN>
Y = <FILL IN>
learning_rate = <FILL IN>

# three convolutional layers with their channel counts, and a fully connected layer 
# (the last layer has 10 softmax neurons)
# the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected
# layer is 200
W1 = <FILL IN>
B1 = <FILL IN>

W2 = <FILL IN>
B2 = <FILL IN>

W3 = <FILL IN>
B3 = <FILL IN>

W4 = <FILL IN>
B4 = <FILL IN>

W5 = <FILL IN>
B5 = <FILL IN>

########################################
# build the model
########################################
stride = 1  # output is 28x28
Y1_hat = <FILL IN> # use tf.nn.conv2d

stride = 2  # output is 14x14
Y2_hat = <FILL IN>

stride = 2  # output is 7x7
Y3_hat = <FILL IN>

# reshape the output from the third convolution for the fully connected layer
YY_hat = tf.reshape(<FILL IN>)
Y4_hat = <FILL IN>
Y_hat = <FILL IN>

########################################
# define the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(<FILL IN>)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

########################################
# define the optmizer
########################################
optimizer = <FILL IN>
train_step = optimizer.minimize(<FILL IN>)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

n_epochs = 5000
with tf.Session() as sess:
    <FILL IN>

# 7. Improve The Performance
A good approach to sizing your neural networks is to implement a network that is a little too constrained, then give it a bit more degrees of freedom and add dropout to make sure it is not overfitting. This ends up with a fairly optimal network for your problem. In the above model, we set the output channel to 4 in the first convolutional layer, which means that we repeat the same filter shape (but with different weights) four times. If we assume that those filters evolve during training into shape recognisers, you can intuitively see that this might not be enough for our problem. Handwritten digits are made from more than 4 elemental shapes. So let us bump up the filter sizes a little, and also increase the number of filters in our convolutional layers from 4, 8, 12 to 6, 12, 24 and then add dropout on the fully-connected layer. The following figure shows the new architecture you should build. Please complete the following code based on the given architecture and dropout technique.
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/18-arch2.png" style="width: 600px;"/>

In [0]:
# TODO: Replace <FILL IN> with appropriate code

# · · · · · · · · · ·    (input data, 1-deep)                 X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @ -- conv. layer 6x6x1=>6 stride 1        W1 [5, 5, 1, 6]        B1 [6]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 6]
#   @ @ @ @ @ @ @ @   -- conv. layer 5x5x6=>12 stride 2       W2 [5, 5, 6, 12]        B2 [12]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 12]
#     @ @ @ @ @ @     -- conv. layer 4x4x12=>24 stride 2      W3 [4, 4, 12, 24]       B3 [24]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 24] => reshaped to YY [batch, 7*7*24]
#      \x/x\x\x/ ✞    -- fully connected layer (relu+dropout) W4 [7*7*24, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/       -- fully connected layer (softmax)      W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = <FILL IN>
Y = <FILL IN>
lr = <FILL IN>
pkeep = <FILL IN>

# three convolutional layers with their channel counts, and a fully connected layer 
# (the last layer has 10 softmax neurons)
# the output depth of first three convolutional layers, are 6, 12, 24, and the size of fully connected
# layer is 200
W1 = <FILL IN>
B1 = <FILL IN>

W2 = <FILL IN>
B2 = <FILL IN>

W3 = <FILL IN>
B3 = <FILL IN>

W4 = <FILL IN>
B4 = <FILL IN>

W5 = <FILL IN>
B5 = <FILL IN>

########################################
# build the model
########################################
stride = 1  # output is 28x28
Y1_hat = <FILL IN>

stride = 2  # output is 14x14
Y2_hat = <FILL IN>

stride = 2  # output is 7x7
Y3_hat = <FILL IN>

# reshape the output from the third convolution for the fully connected layer
YY_hat = tf.reshape(<FILL IN>)
Y4_hat = <FILL IN>
YY4_hat = tf.nn.dropout(<FILL IN>
Y_hat = <FILL IN>

########################################
# define the Loss function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(<FILL IN>)
cross_entropy = tf.reduce_mean(cross_entropy) * 100

########################################
# traini the model
########################################
optimizer = <FILL IN>
train_step = optimizer.<FILL IN>

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

n_epochs = 2000
with tf.Session() as sess:
    <FILL IN>

---
# 8. Tensorflow Layers Module
The TensorFlow **layers** `tf.layers` module provides a high-level API that makes it easy to construct a neural network. It provides methods that facilitate: (i) the creation of dense (fully connected) layers and convolutional layers, (ii) adding activation functions, and (iii) applying dropout regularization. In this section use the module `tf.layers` to build the network you made in section 7.

In [0]:
# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()

<FILL IN> # :)

---
# 9. Keras
Keras is a high-level API to build and train deep learning models. It's used for fast prototyping, advanced research, and production. `tf.keras` is TensorFlow's implementation of the Keras API specification. To work with Keras, you need to import `tf.keras` as part of your TensorFlow program setup.
```
import tensorflow as tf
from tensorflow.keras import layers
```
#### Build a model
In Keras, you assemble **layers** to build a model, i.e., a graph of layers. The most common type of model is a stack of layers: the `tf.keras.Sequential` model. For example, the following code builds a simple, fully-connected network (i.e., multi-layer perceptron):
```
model = tf.keras.Sequential()
# adds a densely-connected layer with 64 units to the model:
model.add(layers.Dense(64, activation='relu'))
# add another
model.add(layers.Dense(64, activation='relu'))
# add a softmax layer with 10 output units:
model.add(layers.Dense(10, activation='softmax'))
```
There are many `tf.keras.layers` available with some common constructor parameters:
* `activation`: set the activation function for the layer, which is specified by the name of a built-in function or as a callable object.
* `kernel_initializer` and `bias_initializer`: the initialization schemes that create the layer's weights (weight and bias).
* `kernel_regularizer` and `bias_regularizer`: the regularization schemes that apply the layer's weights (weight and bias), such as L1 or L2 regularization.

#### Train and evaluate
After you construct a model, you can configure its learning process by calling the `compile` method:
```
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
```
The method `tf.keras.Model.compile` takes three important arguments:
* `optimizer`: it specifies the training procedure, e.g., `tf.train.AdamOptimizer` and `tf.train.GradientDescentOptimizer`.
* `loss`: the cost function to minimize during optimization, e.g., mean square error (mse), categorical_crossentropy, and binary_crossentropy.
* `metrics`: used to monitor training, e.g., `accuracy`.

The next step after confiuring the model is to train it by calling the `model.fit` method and giving it training data as its input. After training the model you can call `tf.keras.Model.evaluate` and `tf.keras.Model.predict` methods to evaluate the inference-mode loss and metrics for the data provided or predict the output of the last layer in inference for the data provided, respectively.

You can read more about Keras [here](https://www.tensorflow.org/guide/keras).

In this task, please use Keras to rebuild the network you made in section 7.

In [0]:
# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()

<FILL IN> # :)

---
# 10. Implement LeNet-5
In this section, you should implement **LeNet-5** either using Tensorflow or Keras. Please take a look at its [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) before starting to implement it.
The LeNet-5 architecture is perhaps the most widely known CNN architecture. It was created by Yann LeCun in 1998 and widely used for handwritten digit recognition (MNIST). It is composed of the layers shown in the following table.

<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/19-letnet5.png" style="width: 600px;"/>

There are a few extra details to be noted:
* MNIST images are 28×28 pixels, but they are zero-padded to 32×32 pixels and normalized before being fed to the network. The rest of the network does not use any padding, which is why the size keeps shrinking as the image progresses through the network.
* The average pooling layers are slightly more complex than usual: each neuron computes the mean of its inputs, then multiplies the result by a learnable coefficient and adds a learnable bias term, then finally applies the activation function.
* Most neurons in layer C3 maps are connected to neurons in only three or four S2 maps (instead of all six S2 maps). See table 1 in the [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) for details.
* The output layer is a bit special: instead of computing the dot product of the inputs and the weight vector, each neuron outputs the square of the Euclidian distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class. The cross-entropy cost function is now preferred, as it penalizes bad predictions much more, producing larger gradients and thus converging faster.

In [0]:
# TODO: Build the LetNet-5 model, and test it on MNIST

# to reset the Tensorflow default graph
reset_graph()

---
# 11. Implement AlexNet
In the last section, you should implement **AlexNet** either using Tensorflow or Keras. Again, please take a look at its [paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) before start to implement it.
The AlexNet CNN architecture won the [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2012/) in 2012 by a large margin. It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. The following table presents this architecture.
<img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/figs/20-alexnet.png" style="width: 600px;"/>

To train the model, we need a big dataset, however, in this assignment you are going to to assign the pretrained weights to your model, using `tf.Variable.assign`. You can download the pretrained weights from [bvlc_alexnet.npy](https://www.cs.toronto.edu/~guerzhoy/tf_alexnet/bvlc_alexnet.npy). This file is a NumPy array file created by the python. After you read this file, you will receive a python dictionary with a <key, value> pair for each layer. Each key is one of the layers names, e.g., `conv1`, and each value is a list of two values: (1) weights, and (2) biases of that layer. Part of the function to load the weights and biases to your model is given, and you need to complete it.

Here is what you see if you read and print the shape of each layer from the file:
```
weight_dic = np.load("bvlc_alexnet.npy", encoding="bytes").item()
for layer in weights_dic:
    print("-" * 20)
    print(layer)
    for wb in weights_dic[layer]:
        print(wb.shape)

#--------------------
# fc8
# (4096, 1000) # weights
# (1000,) # bias
#--------------------
# fc7
# (4096, 4096) # weights
# (4096,) # bias
#--------------------
# fc6
# (9216, 4096) # weights
# (4096,) # bias
#--------------------
# conv5
# (3, 3, 192, 256) # weights
# (256,) # bias
#--------------------
# conv4
# (3, 3, 192, 384) # weights
# (384,) # bias
#--------------------
# conv3
# (3, 3, 256, 384) # weights
# (384,) # bias
#--------------------
# conv2
# (5, 5, 48, 256) # weights
# (256,) # bias
#--------------------
# conv1
# (11, 11, 3, 96) # weights
# (96,) # bias
```


In [0]:
# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()

# build the AlexNet model
<FILL IN> :)

# load inital weights and biases to the model
def load_initial_weights(self, session):
    # load the weights into memory
    weights_dic = np.load('bvlc_alexnet.npy', encoding='bytes').item()

    # loop over all layer names stored in the weights dict
    for layer in weights_dict:
        with tf.variable_scope(layer, reuse=True):
            # loop over list of weights/biases and assign them to their corresponding tf variable
            for wb in weights_dict[layer]:
                # biases
                if len(wb.shape) == 1:
                    bias = tf.get_variable(<FILL IN>)
                    session.run(bias.assign(wb))
                # weights
                else:
                    weight = tf.get_variable(<FILL IN>)
                    session.run(weight.assign(wb))
                

#### Test the model
After building the AlexNet model, you can test it on different images and present the accuracy of the model. To do so, first you need to use **OpenCV** library to make the images ready to give as input to the model. OpenCV is a library used for image processing. Below you can see how to read an image file and pre-process it using OpenCV to give it to the model. However, you need to complete the code and test the accuracy of your model. The teset images (shown below) are available in the `test_images` folder.
<table width="100%">
<tr>
<td><img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/test_images/test_image1.jpg" style="width:200px;"></td>
<td><p align="center"><img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/test_images/test_image2.jpg" style="width:200px;"></td>
<td align="right"><img src="https://raw.githubusercontent.com/ssheikholeslami/id2223-lab2/master/test_images/test_image3.jpg" style="width:200px;"></td>
</tr>

In [0]:
# TODO: Replace <FILL IN> with appropriate code
# test the AlexNet model on the given images

import cv2

#get list of all images
current_dir = os.getcwd()
image_path = os.path.join(current_dir, 'test_images')
img_files = [os.path.join(image_path, f) for f in os.listdir(image_path) if f.endswith('.jpg')]

#load all images
imgs = []
for f in img_files:
    imgs.append(cv2.imread(f))

with tf.Session() as sess:
    <FILL IN>
    
    # loop over all images
    for i, image in enumerate(imgs):
        # convert image to float32 and resize to (227x227)
        img = cv2.resize(image.astype(np.float32), (227, 227))
        
        # subtract the ImageNet mean
        # Mean subtraction per channel was used to center the data around zero mean for each channel (R, G, B).
        # This typically helps the network to learn faster since gradients act uniformly for each channel.
        imagenet_mean = np.array([104., 117., 124.], dtype=np.float32)
        img -= imagenet_mean
        
        # reshape as needed to feed into model
        img = img.reshape((1, 227, 227, 3))
        
        <FILL IN>