# Modern Data Science 
**(Module 05: Deep Learning)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session D - Deep Feedforward Neural Networks

**The purpose of this session is to demonstrate how to use TensorFlow to develop machine learning algorithms and deep neural network models. In this practical session, we present the following topics:**

1. How to implement classification algorithms using Tensorflow
2. Learning  standard and advanced gradient optimization methods and use these predefined optimizer inTensorFlow
3. How to build a deep neural networks for images classification problems usingTensorFlow

** References and additional reading and resources**
- [An Introduction to Implementing Neural Networks using TensorFlow](https://www.analyticsvidhya.com/blog/2016/10/an-introduction-to-implementing-neural-networks-using-tensorflow/)
- [Tensorflow Tutorials](https://www.tensorflow.org/tutorials/)
- [Deep Learning Book](http://www.deeplearningbook.org/)
- [37 Reasons why your Neural Network is not working](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607)

---





## <span style="color:#0b486b">1. Classification with TensorFlow</span>

For a classification problem, we wish to predict disrcrete label $y$ given features $\mathbf{x}=\left[x_{1},x_{2},...,x_{N}\right]^{T}$, wherein $y\in\left\{0,1\right\}$ for binary classification and $y\in\left\{ 1,...,K\right\}$ for multiclass classification. We typically solve this problem by learning a function to predict log-probability that an example belong to each class and then apply the principle of maximum likelihood to derive the loss function. Let's first consider the binary case. 

### <span style="color:#0b486b">1.1. Binary classification</span>
#### Problem formulation
For the binary case, we need to learn the class-conditional densities:  

$$
p(y=1 \mid \mathbf{x})	=	\frac{p(\mathbf{x} \mid y=1)p(y=1)}{p(\mathbf{x} \mid y=0)p(y=0)+p(\mathbf{x} \mid y=1)p(y=1)}
	=	\frac{1}{1+\frac{p(\mathbf{x} \mid y=0)p(y=0)}{p(\mathbf{x} \mid y=1)p(y=1)}}
	=	\frac{1}{1+\exp(-a)}
	=	\sigma(a)
$$

where $a=\frac{p(\mathbf{x} \mid y=1)p(y=1)}{p(\mathbf{x} \mid y=0)p(y=0)}$, and $\sigma(a)$ is the **logistic sigmoid function** defined by:  

$$
\sigma(a)=\frac{1}{1+\exp(-a)}
$$ 

If we use a function $f$, which can be modeled using simple linear regression or a neural network with parameter $\mathbf{\theta}$, to estimate $a$, we'll have:  

$$
p(y=1 \mid \mathbf{x})=\sigma(a)=\sigma(f_{\mathbf{\theta}}(\mathbf{x}))
$$ 

For a set of $M$ training examples with binary labels $\left\{(\mathbf{x}^{(i)},y^{(i)}):i=1,...,M\right\}$, the probability according to our model will be:

$$
\mathcal{L}=\prod_{i=1}^{M}p(y=y^{(i)} \mid \mathbf{x}^{(i)})=\prod_{i=1}{M}\sigma(a_{i})^{y^{(i)}}(1-\sigma(a_{i}))^{(1-y^{(i)})}
$$ 

wherein $a_{i}=f_{\mathbf{\theta}}(\mathbf{x})$.

#### <span style="color:#0b486b">Loss function:</span>

We need to find $\mathbf{\theta}$ that maximizes the likelihood $\mathcal{L}$. This can be done more easily by minimizing the negative log-likelihood. Therefore, we have the cost function:

$$
J(\mathbf{\theta})=\sum_{i=1}^{M}y^{(i)}\log\sigma(a_{i})+(1-y^{(i)})\log(1-\sigma(a_{i}))
$$


#### Implementation with TensorFlow

In terms of implementation, given a scalar $a$ in the case of logistic classification , we can use the following code to calculate the conditional probability:

In [None]:
import tensorflow as tf
a = tf.constant([10.0, -2.0])
y_proba = tf.nn.sigmoid(a)
with tf.Session() as sess:
    print(y_proba.eval())

Then we can calculate the loss function:

In [None]:
y = tf.constant([1.0, 0.0])
loss = tf.reduce_mean(-y * tf.log(y_proba) - (1 - y) * tf.log(1 - y_proba))
with tf.Session() as sess:
    print(loss.eval())

However, calculating the probability is numerically unstable when $a$ is too large or too small due to the exponential operation. Try this:

In [None]:
a = tf.constant([20.0, -5.0])
y_proba = tf.nn.sigmoid(a)
with tf.Session() as sess:
    print(y_proba.eval())
y = tf.constant([1.0, 0.0])
loss = tf.reduce_mean(-y * tf.log(y_proba) - (1 - y) * tf.log(1 - y_proba))
with tf.Session() as sess:
    print(loss.eval())

We are thus recommended to estimate the loss directly from logits, using the following code:

In [None]:
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=a))
with tf.Session() as sess:
    print(loss.eval())

### <span style="color:#0b486b">1.2. Multiclass classification</span>

In multiclass classification problem when $K>2$, we have:

$$
p(y=k \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid y=k)p(y=k)}{\sum_{j=1}^{K}p(\mathbf{x} \mid y=j)p(y=j)}
	= \frac{\exp(a_{k})}{\sum_{j=1}^{K}\exp(a_{j})}
	= \sigma(\mathbf{a})_{k}
$$ 

where $a_k=\log p(\mathbf{x} \mid y=k)p(y=k)$ is called logits, and $\sigma(\mathbf{a})_{k}$ is the softmax function defined by:

$$
\sigma(\mathbf{a})_{k}=\frac{\exp(a_{k})}{\sum_{j=1}^{K}\exp(a_{j})}
$$

Similary to the binary classification, we can use simple linear regression or a neural network with parameter $\mathbf{\theta}$   to model a function $f_{\mathbf{\theta}}(\mathbf{x})=\left[a_{1},a_{2},...,a_{K}\right]^{T}$. For a set of $M$ training examples with multinominal labels $\left\{(\mathbf{x}^{(i)},y^{(i)}):i=1,...,M\right\}$, the probability according to our model will be:

$$
\mathcal{L}=\prod_{i=1}^{M}p(y=y^{(i)}|\mathbf{x}^{(i)})=\prod_{i=1}^{M}\sigma(\mathbf{a}^{(i)})_{y^{(i)}}
$$

#### <span style="color:#0b486b">Loss function:</span>

We need to find $\mathbf{\theta}$ that minimizes the negative log-likelihood so, we have the cost function:

$$
J(\mathbf{\theta})=-\sum_{i=1}^{M}\log\sigma(\mathbf{a}^{(i)})_{y^{(i)}}
$$
If we express the label for the $i$-th example as a **one-hot** vector $\mathbf{y}^{(i)}=\left[y_{1}^{(i)},y_{2}^{(i)},...,y_{K}^{(i)}\right]^{T}$, where $y_{k}^{(i)}=1$ if the example belongs to class $k$, and $y_{k}^{(i)}=0$ otherwise, the cost function can be rewritten as:   

$$J(\mathbf{\theta})=-\sum_{i=1}^{M}\sum_{j=1}^{K}\log y_{j}^{(i)}\sigma(\mathbf{a}^{(i)})_{j}$$

We now can use the following code to calculate the conditional probability:

In [None]:
import tensorflow as tf
a = tf.constant([[10.0, -2.0], [-5.0, 3.0]])
y_proba = tf.nn.softmax(a)
with tf.Session() as sess:
    print(y_proba.eval())

Then we can calculate the loss function:

In [None]:
y = tf.constant([[1.0, 0.0], [0.0, 1.0]])
loss = tf.reduce_mean(-tf.reduce_sum(y * tf.log(y_proba), axis=1))
with tf.Session() as sess:
    print(loss.eval())

However, similar to the problem in binary classification, calculating the probability is numerically unstable when $a$ is too large or too small due to the exponential operation. Try this:

In [None]:
a = tf.constant([[20.0, -5.0], [-5.0, 3.0]])
y_proba = tf.nn.softmax(a)
with tf.Session() as sess:
    print(y_proba.eval())
y = tf.constant([[1.0, 0.0], [0.0, 1.0]])
loss = tf.reduce_mean(-y * tf.log(y_proba) - (1 - y) * tf.log(1 - y_proba))
with tf.Session() as sess:
    print(loss.eval())

We are thus recommended to estimate the loss directly from logits, using the following code:

In [None]:
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=a)) 
with tf.Session() as sess:
    print(loss.eval())

When labels is a one dimensional vector, we use the following code:

In [None]:
a = tf.constant([[10.0, -2.0], [-5.0, 3.0]])
y = tf.constant([1, 0])
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=a))
with tf.Session() as sess:
    print(loss.eval())

### <span style="color:#0b486b">1.3.3. Gradient learning</span>

Once you define the loss function, you can minimize the loss by using gradient descent algorithm or its variants. The idea is that starting at a point in the parameter space, following the directions of gradients estimated at that point will increase the loss function. To reduce the loss, you should go the opposite directions. So, the basic gradient descent algorithm is to iteratively update the parameters until reaching a global (local) minima. For parameters $\mathbf{\theta}$ with gradients $\frac{\partial{J}}{\partial{\mathbf{\theta}}}$, the update will take the form:

$$
\mathbf{\theta}=\mathbf{\theta}-\eta\frac{\partial{J}}{\partial{\mathbf{\theta}}}
$$

where $\eta > 0$ is the learning rate. Learning rate $\eta$ is an important parameter to be tuned. A low learning rate cause the model to learn slowly. High learning rate can help the model learn faster but may fail to converge as the parameters will bound around the convergence point in the parameter space.

<img src="https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/dl/example02/high-learning-rate.png", width=300>

A simple way to select learning rate is to try different learning rates, typically between `1e-5` and `1`, and draw the graph of loss after each iterations or epochs. The following image sketches some scenerios of good or bad learning rate:

<img src="https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/dl/example02/learning-rate.jpg", width=300>

There are variants of gradient descent that can reach convergence faster but they are all based on gradients. In what follows, we'll briefly talk about some of them.

#### <span style="color:#0b486b">1.3.1. Stochastic Gradient Descent (SGD)</span>

For gradient descent, we need to empirically estimate the average loss using all training data:
$$
J(\mathbf{\theta})=\frac{1}{M}\sum_{i=1}^{M}\mathcal{L}(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{\theta})
$$

where $\mathcal{L}$ is the per-example loss $\mathcal{L}(\mathbf{x}, y, \mathbf{\theta})=-\log{p(y|\mathbf{x};\mathbf{\theta})}$.
Unfortunately, estimating the average loss over a large training set is computationally expensive. Stochastic gradient descent (**SGD**) is an extension of the gradient descent algorithm. For each step of the training algorithm, we can sample a minibatch of example $\mathbb{B}=\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},..,\mathbf{x}^{(m^{\prime})}\}$ The minibatch size $m^{\prime}$ is typically chosen to be a relatively small number of examples, ranging from 1 to a few hundred. The estimate of the gradient will be:
$$
\mathbf{g}=\frac{1}{m^{\prime}}\nabla_{\mathbf{\theta}}\sum_{i=1}^{m^{\prime}}\mathcal{L}(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{\theta})
$$

The weight update in stochastic gradient descent will be:

$$
\mathbf{\theta}=\mathbf{\theta}-\eta\mathbf{g}
$$

Optimization algorithms that use only a single example at a time are called stochastic methods while optimization algorithms that use more than a single examples at a time are traditionally called minibatch stochastic methods. It's now common to simply call them stochastic methods.

<img src="https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/note.gif" width="20", align="left"></img> **Some notes**:
- Using small batch size can lead to unstable estimates. The standard error of the mean estimated from $m^{\prime}$ samples is given by $\sigma\mathbin{/}\sqrt{m^{\prime}}$. So, Training with such a small batch size might require a small learning rate to maintain stability due to the high variance in the estimate of the gradient.
- Large batches provide a more accurate estimate of the gradient. However, as we increase batch size from 100 to 10,000, the latter requires 100 times more computation than the former but reduces the standard error of the mean only by a factor of 10.
- Small batches can offer regularizing effect perhaps due to the noise they add to during the learning process. Generalization error is often best for a batch size of 1. However, the total runtime can be very high due to the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.
- Multicore architectures are usually underutilized by extremely small batches.
- Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
- The simplest way to do feed a minibatch to our computational graph in Tensorflow is to use placeholder nodes. They are typically used to pass the training data to TensorFlow during training.

**Reference: Chapter 8 - Adaptive Computation and Machine Learning, Deep Learning Textbook**

#### <span style="color:#0b486b">1.3.2. Momentum</span>

Stochastic gradient descent can sometimes be slow, especially when the gradients are of high curvature, small but consistent or noisy. Imagine the loss as the height of a canyon with steep sides. Randomly initializing the parameters is like starting at a location in the canyon and optimizing the loss is like going down the canyon. The black path in the picture below depict how SGD wastes time moving back and forth across the narrow axis of the canyon:

<figure>
  <img src="https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/dl/example02/momentum.PNG" width=300>
  <figcaption text-align="center">(*The picture is adapted from Chapter 8, Deep Learning Textbook*)</figcaption>
</figure>

Momentum algorithm, proposed by Boris Polyak in 1964, accelerates learning in such situation by accumulating an exponentially decaying moving average of past gradients and continues to move in their direction. Momentum introduces a velocity vector $\mathbf{v}$. It is the direction and the speed at which the parameters move through the hyperparameter space. The velocity is set to an exponentially decaying average of the negative gradient. We can interpret the decay as a result of friction that keeps the velocity from growing too large. The name momentum is derived from a physic analogy, in which the negative gradient is a force moving a particle through parameter space, according to Newton’s laws of motion. Momentum in physics is mass times velocity. In the momentum learning algorithm, we assume unit mass, so the velocity vector $\mathbf{v}$ may also be regarded as the momentum $\mathbf{m}$ of the particle.

At each iteration, the local gradient (multiplied by the learning rate $\eta$) is added to the momentum vector $\mathbf{m}$, and you can update the weights by simply subtract this momentum vector:

$$
\begin{align}
\mathbf{m} &= \beta\mathbf{m} + \eta\nabla_{\mathbf{\theta}}J(\mathbf{\theta}) \\
\mathbf{\theta} &= \mathbf{\theta} - \mathbf{m}
\end{align}
$$

where $\beta$ is the momentum hyperparameter. The higher the hyperparameter is, the bigger the influence of past gradient is. The momentum hyperparameter is usually start at a small number, such as 0.5, and increased gradually to 0.9 or 0.99.

When local gradient keeps pointing to a certain direction, momentum will pickup in that direction, allowing the update path to tranverse the canyon lengthwise and converge faster, as shown by the red path in the picture above.

#### <span style="color:#0b486b">1.3.3. AdaGrad</span>

Consider a elongated bowl with a gentle slope, gradient descent starts by starts by quickly going down the steepest slope, then slowly goes down the bottom of the valley. We may wish a greater progress in the more gently slopped directions of the hyperparmeter space.

<img src='https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/dl/example02/adagrad.PNG' width=500>

(*The picture is adapted from chapter 11, Hands-On Machine Learning with Scikit-Learn and Tensorflow*)

AdaGrad achieves this goal by individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values. The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate.

Each update in the AdaGrad algorithm takes two steps. The first step in AdaGrad algorithm is to accumulate the square of the gradients into the vector $\mathbf{s}$. The second step is to update parameters as usual, however with one big difference: the learning rate is scaled down by a factor of $\sqrt{\mathbf{s}+\epsilon}$ where $\epsilon$ is a small number, typically 1e-8, to avoid division by zero.

$$
\begin{align}
\mathbf{s} &= \mathbf{s}+\nabla_{\mathbf{\theta}}J(\mathbf{\theta})\otimes{J(\mathbf{\theta})} \\
\mathbf{\theta} &= \mathbf{\theta}-\eta\nabla_{\theta}J(\mathbf{\theta})\oslash\sqrt{\mathbf{s}+\epsilon}
\end{align}
$$

where $\otimes$ is element-wise product and $\oslash$ is element-wise division.

#### <span style="color:#0b486b">1.3.4. RMSProp</span>

Empirically AdaGrad has been found that—for training deep neural network models—the accumulation of squared gradients from the beginning of training can result in a premature and excessive decrease in the effective learning rate. AdaGrad performs well for some but not all deep learning models. AdaGrad is designed to converge rapidly when applied to a convex function. When applied to a non-convex function to train a neural network, the learning trajectory may pass through many different structures and eventually arrive at a region that is a locally convex bowl AdaGrad shrinks the learning rate according to the entire history of the squared gradient and may have made the learning rate too small before arriving at such a convex structure.

The RMSProp algorithm modifies AdaGrad by changing the gradient accumulation in the first step into an exponentially weighted moving average:

$$
\begin{align}
\mathbf{s} &= \beta\mathbf{s}+(1-\beta)\nabla_{\mathbf{\theta}}J(\mathbf{\theta})\otimes{J(\mathbf{\theta})} \\
\mathbf{\theta} &= \mathbf{\theta}-\eta\nabla_{\theta}J(\mathbf{\theta})\oslash\sqrt{\mathbf{s}+\epsilon}
\end{align}
$$

Using exponentially decaying average allows RMSProp to discard history from the extreme past so that it can converge rapidly after finding a convex bowl.

#### <span style="color:#0b486b">1.3.5. Adam</span>

**Adam** combines the ideas of Momentum optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared
gradients.

There are five steps in Adam:

$$
\begin{align}
\mathbf{m} &= \beta_{1}\mathbf{m}+(1-\beta_{1})\nabla_{\mathbf{\theta}}J(\mathbf{\theta}) \\
\mathbf{s} &= \beta_{2}\mathbf{s}+(1-\beta_{2})\nabla_{\mathbf{\theta}}J(\mathbf{\theta})\otimes{J(\mathbf{\theta})} \\
\mathbf{m} &= \frac{\mathbf{m}}{1-\beta_{1}^{t}} \\
\mathbf{s} &= \frac{\mathbf{s}}{1-\beta_{2}^{t}} \\
\mathbf{\theta} &= \mathbf{\theta}-\eta\mathbf{m}\oslash\sqrt{\mathbf{s}+\epsilon} \\
\end{align}
$$

where $t$ is the iterationi number, starting at 1.

Step 1 and 2 estimate the first and second order moment of the gradient. Steps 3 and 4 are somewhat of a technical detail: since $m$ and $s$ are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost $m$ and $s$ at the beginning of training.

The momentum decay hyperparameter $\beta_{1}$ is typically initialized to 0.9, while the scaling decay hyperparameter $\beta_{2}$ is often initialized to 0.999. As earlier, the smoothing term $\epsilon$ is usually initialized to a tiny number such as $10^{-8}$.

#### <span style="color:#0b486b">1.3.6. Implementation in TensorFlow</span>

To implement gradient descent-based optimization, you'll need to calculate the gradients of the loss. You can manually derive the gradients from the cost function. In the case of Linear Regression, it is reasonably easy, but if you had to do this with deep neural networks we would get quite a headache: it would be tedious and error-prone. You can instead use TensorFlow’s autodiff feature to let TensorFlow compute the gradients automatically or use a couple of TensorFlow’s out-of-the-box optimizers.

<img src="https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/note.gif" width="60", align="left"></img> When using Gradient Descent, remember that it is important to first normalize the input feature vectors, or else training may be much slower. You can do this using TensorFlow, NumPy, Scikit-Learn’s StandardScaler, or any other solution you prefer. The
following code assumes that this normalization has already been done.

TensorFlow also provides a number of **off-the-shelf optimizers**, including a Gradient Descent optimizer. You can simply declare an optimizer and add an *`op`* that performs an upgrade step:

``optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(mse)``

Current list of optimizers implemented in TensorFlow is:
1. tf.train.GradientDescentOptimizer
* tf.train.AdadeltaOptimizer
* tf.train.AdagradOptimizer
* tf.train.AdagradDAOptimizer
* tf.train.MomentumOptimizer
* tf.train.AdamOptimizer
* tf.train.FtrlOptimizer
* tf.train.ProximalGradientDescentOptimizer
* tf.train.ProximalAdagradOptimizer
* tf.train.RMSPropOptimizer

... and more are coming. ***Reference***: [https://www.tensorflow.org/api_guides/python/train](https://www.tensorflow.org/api_guides/python/train)

If you want to try momentum optimizer or Adam optimizer, just replace with one of the following lines of code:

In [None]:
learning_rate=0.1
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=0.9)

### <span style="color:#0b486b">1.4. Putting it altogether for multiclass classification</span>

Now, we've had all ingredients to build a logistic or softmax classification. Let's build a simple softmax classification with MNIST dataset.

* #### <span style="color:#0b486b">Step 1: Load or download the dataset</span>

In [None]:
!ls
#! mkdir dl

In [None]:
from tensorflow.examples.tutorials.mnist import input_data

# https://github.com/tuliplab/mds/tree/master/Jupyter/data/dl
# You may need to download all data to a local folder

mnist = input_data.read_data_sets("dl/")
X_train = mnist.train.images
X_test = mnist.test.images
y_train = mnist.train.labels.astype("int")
y_test = mnist.test.labels.astype("int")

* #### <span style="color:#0b486b">Step 2: Build the graph using TensorFlow</span>


In [None]:
import tensorflow as tf
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=[None, n_inputs], name='X')
y = tf.placeholder(tf.int64, shape=[None], name='y')

W = tf.Variable(tf.truncated_normal([n_inputs, n_outputs], stddev=0.02), name='weights')
b = tf.Variable(tf.zeros([n_outputs]), name='biases')

logits = tf.add(tf.matmul(X, W), b, name='logits')

with tf.name_scope('evaluation'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits, name='xentropy')
    loss = tf.reduce_mean(xentropy, name='loss')
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

with tf.name_scope("train"):
    grad_W, grad_b = tf.gradients(loss, [W, b])
    update_W = tf.assign(W, W - learning_rate * grad_W)
    update_b = tf.assign(b, b - learning_rate * grad_b)

init = tf.global_variables_initializer()

* #### <span style="color:#0b486b">Step 3: Train the model</span>


In [None]:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    print("Epoch\tTrain accuracy\tTest accuracy")
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run([update_W, update_b], feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print('%d\t%f\t%f' % (epoch, acc_train, acc_test))

We end up with the testing acurracy of about 92% that is not bad result. Now let's try using black-box gradient descent optimizer of TensorFlow. We will change the code in **Step 2**, which we denote as **Step 2(a)**, **Step 2(b)**,... The code for **Step 3** remains the same the we call **Step 3 (replicate)**.

* #### <span style="color:#0b486b">Step 2(a): Build the graph using TensorFlow gradient descent optimizer</span>

In [None]:
import tensorflow as tf
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=[None, n_inputs], name='X')
y = tf.placeholder(tf.int64, shape=[None], name='y')

W = tf.Variable(tf.truncated_normal([n_inputs, n_outputs], stddev=0.02), name='weights')
b = tf.Variable(tf.zeros([n_outputs]), name='biases')

logits = tf.add(tf.matmul(X, W), b, name='logits')

with tf.name_scope('evaluation'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits, name='xentropy')
    loss = tf.reduce_mean(xentropy, name='loss')
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()

* #### <span style="color:#0b486b">Step 3 (replicate): Train the model</span>

In [None]:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    print("Epoch\tTrain accuracy\tTest accuracy")
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print('%d\t%f\t%f' % (epoch, acc_train, acc_test))

We end up with the same result as of using auto-gradient, which is expected. Now let's try **Adam optimizer**.

* #### <span style="color:#0b486b">Step 2(b): Build the graph using TensorFlow Adam optimizer</span>

In [None]:
import tensorflow as tf
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
learning_rate = 0.001

X = tf.placeholder(tf.float32, shape=[None, n_inputs], name='X')
y = tf.placeholder(tf.int64, shape=[None], name='y')

W = tf.Variable(tf.truncated_normal([n_inputs, n_outputs], stddev=0.02), name='weights')
b = tf.Variable(tf.zeros([n_outputs]), name='biases')

logits = tf.add(tf.matmul(X, W), b, name='logits')

with tf.name_scope('evaluation'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits, name='xentropy')
    loss = tf.reduce_mean(xentropy, name='loss')
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

with tf.name_scope("train"):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()

* #### <span style="color:#0b486b">Step 3 (replicate): Train the model</span>

In [None]:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    print("Epoch\tTrain accuracy\tTest accuracy")
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print('%d\t%f\t%f' % (epoch, acc_train, acc_test))

We have just improved the accuracy by 1%, which is not really cool but promising.

Now we are going to use deeper and more powerful models in the consequence section.

### 1.4  Exercises

Train classifier in Section A.3 the using some other optimizers provided by TensorFlow: [Adadelta](https://www.tensorflow.org/api_docs/python/tf/train/AdadeltaOptimizer) and [RMSProp](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer). You can try with different parameters of these optimizers and report the best values (of parameters and corresponding performances).

## <span style="color:#0b486b">2. Deep Feedforward Neural Networks</span>

Deep feedforward neural network (DNN) basically is an advanced version of the multilayer perceptron (MLP) with nonlinear hidden activations. DNNs are at the very core of *Deep Learning*. They are versatile, powerful, and scalable, making them
ideal for tackling large and highly complex Machine Learning tasks, such as classifying billions of images
(e.g., Google Images), powering speech recognition services (e.g., Apple’s Siri), recommending the best
videos to watch to hundreds of millions of users everyday (e.g., YouTube), or learning to beat the world
champion at the game of Go by examining millions of past games and then playing against itself
(DeepMind’s AlphaGo).<br>

In this session, we are going to use TensorFlow's Python API to implement MNIST digit classification problem. We will use minibatch gradient descent to train our network. Generally, the first step is the construction phase, i.e., building the TensorFlow graph, and the second step is the execution phase, where you actually run the graph to train the model. 




Now let's first import the TensorFlow library and load the MNIST dataset.

In [None]:
import tensorflow as tf
import numpy as np

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("dl/")
x_train = mnist.train.images
x_test = mnist.test.images
y_train = mnist.train.labels.astype("int")
y_test = mnist.test.labels.astype("int")

When developing an application using an DNN, we usually following two main stages:

1. Construction phase
2. Execution phase (training and testing)
3. Regularization

However, DNNs are easy to be overfitting and producing low performances on testing datasets. Regularization techniques are often used in DNNs to prevent them from overfitting. We also introduce some regularization methods which are popularly used in DNNs  then.


## <span style="color:#0b486b">2.1. Construction phase</span>


Let’s start. First, we need to specify the number of inputs and outputs, and set the number of hidden neurons in each layer:

In [None]:
num_inputs = 28 * 28 # this is the size of images in pixels 
num_hidden1 = 300
num_hidden2 = 100
num_outputs = 10  # this is the number of classes (label)

Next we use placeholder nodes to represent the training data and labels. The shapes of *`x`* and *`y`* are only partially defined to be able to take an arbitrary minibatch size.

In [None]:
x = tf.placeholder(tf.float32, shape=[None, num_inputs], name="x")
y = tf.placeholder(tf.int32, shape=[None], name="y")

Now we define a function, i.e., layer creator, that can help to create a layer in our multiple layer network.


In [None]:
def neuron_layer(x, num_neurons, name, activation=None):
    with tf.name_scope(name):
        num_inputs = int(x.get_shape()[1])
        stddev = 2 / np.sqrt(num_inputs)
        init = tf.truncated_normal([num_inputs, num_neurons], stddev=stddev)
        W = tf.Variable(init, name="weights")
        b = tf.Variable(tf.zeros([num_neurons]), name="biases")
        z = tf.matmul(x, W) + b
    if activation == "relu":
        return tf.nn.relu(z)
    else:
        return z

Let's go through the above code line by line:

1. First we create a name scope using the name of the layer: it will contain all the computation nodes for this neuron layer. This is optional, but the graph will look much nicer in TensorBoard if its nodes are well organized.
2. Next, we get the number of inputs by looking up the input matrix’s shape and getting the size of the second dimension (the first dimension is for instances).
3. The next three lines create a W variable that will hold the weights matrix. It will be a 2D tensor containing all the connection weights between each input and each neuron; hence, its shape will be `(n_inputs, n_neurons)`. It will be initialized randomly, using a [truncated normal (Gaussian) distribution](https://www.tensorflow.org/api_docs/python/tf/truncated_normal) with a standard deviation of $\frac{2}{\sqrt{n_{inputs}}}$. Using this specific standard deviation helps the algorithm converge much faster.
4. The next line creates $\mathbf{b}$ variable for biases, initialized to `0` (no symmetry issue in this case), with one bias parameter per neuron.
5. Then we create a subgraph to compute $\mathbf{Z} = \mathbf{W}^{T} \mathbf{X}  + \mathbf{b}$. This vectorized implementation will efficiently compute the weighted sums of the inputs plus the bias term for each and every neuron in the layer, for all the instances in the batch in just one shot.
6. Finally, if the activation parameter is set to "relu", the code returns *`relu(z)`* (i.e., *`max(0,z)`*), or else it just returns a linear *`z`*.

Now let’s use the *`neuron_layer`* function to create the deep neural network! The first hidden layer takes *`x`* as its input. The second takes the output of the first hidden layer as its input. And finally, the output layer takes the output of the second hidden layer as its input.

In [None]:
with tf.name_scope("dnn"):
    hidden1 = neuron_layer(x, num_hidden1, "hidden1", activation="relu")
    hidden2 = neuron_layer(hidden1, num_hidden2, "hidden2", activation="relu")
    logits = neuron_layer(hidden2, num_outputs, "output")

We need to define a loss function. As discussed in the last lecture, it's more numerical to estimate the cross entropy loss directly from logits using TensorFlow function:

In [None]:
with tf.name_scope('evaluation'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,\
                                        logits=logits, name='xentropy')
    loss = tf.reduce_mean(xentropy, name="loss")
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

*`tf.nn.sparse_softmax_cross_entropy_with_logits()`* computes the cross entropy based on the “logits” (i.e., the output of the network before going through the softmax activation function), and it expects labels in the form of integers ranging from 0 to the number of classes minus 1 (in our case, from 0 to 9). This will give us a 1D tensor containing the cross entropy for each instance. We can then use TensorFlow’s *`reduce_mean()`* function to compute the mean cross entropy over all instances. 

We also wish to estimate the accuracy of our model. For this you can use the [*`in_top_k()`* function](https://www.tensorflow.org/api_docs/python/tf/nn/in_top_k) with *k=1*. This returns a 1D tensor full of boolean values, so we need to cast these booleans to floats and then compute the average. This will give us the network’s overall accuracy.

Now we need to define a GradientDescentOptimizer that will tweak the model parameters to minimize the cost function.

In [None]:
with tf.name_scope("train"):
    learning_rate = 0.01
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

Finally, we need to create a node to initialize all variables, and we will also create a *`Saver`* to
save our trained model parameters to disk:

In [None]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

**Let's review the construction phase a little bit. We have:**
- Created placeholders for the inputs and the targets;
- Ceated a function to build a neuron layer and used it to create the DNN;
- Defined the cost function and performance measure; and
- Created an optimizer.

Now move onto the execution phase.

## <span style="color:#0b486b">2.2. Execution phase</span>

We first define the number of epochs that we want to run, as well as the size of the minibatches:

In [None]:
num_epochs = 20
batch_size = 50

We can check our training and testing datasets

In [None]:
print(mnist.train.images.shape)
print(mnist.test.images.shape)

Now we can train our model:

In [None]:
with tf.Session() as sess:
    init.run()
    print("Epoch\tTrain accuracy\tTest accuracy")
    for epoch in range(num_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            x_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={x: x_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={x: mnist.train.images, y: mnist.train.labels})
        acc_test = accuracy.eval(feed_dict={x: mnist.test.images, y: mnist.test.labels})
        print("{}\t{}\t{}".format(epoch, acc_train, acc_test))

    save_path = saver.save(sess, "models/example03/dnn_final.ckpt")

<img src="https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/warning.png" width="40", align="left"></img> If the folder containing this notebook does not contain `models` folder (which is the parent folder in the last line of code), you will get the error 

ValueError: ``Parent directory of models/example03/dnn_final.ckpt doesn't exist, can't save.``

You **must** create `models` in the folder containing this notebook to fix the error.


## <span style="color:#0b486b">2.3. Combining two phases using API functions of TensorFlow</span>

As you might expect, TensorFlow comes with many handy functions to create standard neural network layers, so there’s often no need to define your own *`neuron_layer()`* function like we just did. For example, *`tf.layers.dense()`* function creates a fully connected layer, where all the inputs are connected to all the neurons in the layer. It takes care of creating the weights and biases variables, and it set the activation argument to *`None`*, but we can set it to activation functions such as *`tf.nn.relu`*. Let’s tweak the preceding code to use the *`tf.layers.dense()`* function instead of our *`neuron_layer()`* function.

* **Building a DNN model using predefined functions in TensorFlow**:

In [None]:
tf.reset_default_graph()

num_inputs = 28 * 28
num_hidden1 = 300
num_hidden2 = 100
num_outputs = 10
learning_rate = 0.01

x = tf.placeholder(tf.float32, shape=(None, num_inputs), name="x")
y = tf.placeholder(tf.int64, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(x, num_hidden1, name="hidden1", activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, num_hidden2, name="hidden2", activation=tf.nn.relu)
    logits = tf.layers.dense(hidden2, num_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

* **Training and Testing (Execution) the DNN model defined in the previous step:**

In [None]:
num_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    print("Epoch\tTrain accuracy\tTest accuracy")
    for epoch in range(num_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            x_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={x: x_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={x: mnist.train.images, y: mnist.train.labels})
        acc_test = accuracy.eval(feed_dict={x: mnist.test.images, y: mnist.test.labels})        
        print("{}\t{}\t{}".format(epoch, acc_train, acc_test))


### 2.4 Excercises


- Change five different value for the number of hidden nodes in each layer in construction step (B.1) and report the best numbers among your chosen number.
- Increase the number of hidden layers to **three** and set five values for the number of hidden nodes then report the best value and its performance.
- Try to change the optimizer to train the model to [Adam](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) and [RMSProp](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer)