<a href="https://colab.research.google.com/github/yexf308/MAT592/blob/main/24_Neural_Networks2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Part of the notebook is based on "Neural Networks and Deep Learning" and Prof Guangliang Chen's notes

In [None]:
%pylab inline 
import numpy.linalg as LA
from IPython.display import Image

Populating the interactive namespace from numpy and matplotlib


## Tip of the day - Progress Bar

When running a long calculation, we would usually want to have a progress bar to track the progress of our process. One great python package for creating such a progress bar is [**tqdm**](https://github.com/tqdm/tqdm). This package is easy to use and offers a highly customizable progress bar. 

For example, to add a progress bar to an existing loop, simply surrounding the iterable which the loops run over with the **tqdm_notebook** command:

In [None]:
import tqdm
import time
for i in tqdm.notebook.tqdm(range(10)):
    print('Step {}'.format(i))
    time.sleep(1)

  0%|          | 0/10 [00:00<?, ?it/s]

Step 0
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Step 9


# How do we train a neural network?

In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/DNN_notation.png?raw=true', width=500))

Here we will use the **sigmoid function** $\sigma$ and **logistic loss** in the output layer. 

### Notation
For each $l = 1,\dots, L$:   

- $w_{jk}^{(l)}$: layer $l$, $j$ back to $k$ weight. 
  - $\mathbf{W}^{(l)}=\left(w_{jk}^{(l)}\right)_{j,k}$, matrix of all weights between layers $l-1$ and $l$. 

- $b_j^{(l)}$: layer $l$, neuron $j$ bias. 
  - $\mathbf{b}^{(l)}=\left(b_j^{(l)}\right)_j$: vector of biases in layer $l$. 

- $a_j^{(l)}$: layer $l$, neuron $j$ output.
   - $\mathbf{a}^{(l)}=\left(a_j^{(l)}\right)_j$: vector of outputs from neurons in layer $l$. 

- $z_j^{(l)}=\sum_k w_{jk}^{(l)} a_k^{(l-1)}+b_j^{(l)}$: weighted input to neuron $j$ in layer $l$.
   - $\mathbf{z}^{(l)}=\left(z_j^{(l)}\right)_j$vector of weighted inputs to neurons in layer $l$. 

- $a_j^{(l)}=g(z_j^{(l)})$ and $\mathbf{a}^{(l)}=g(\mathbf{z}^{(l)})$ (componentwise), where $g$ is the activation function. 

- $\theta = \{\mathbf{W}^{(l)}, \mathbf{b}^{(l)}\}_{l=1}^L$ are parameters. The whole dataset $\mathcal{D}=\{\mathbf{x}^{(i)},y^{(i)}\}_{i=1}^N$.

### Forward propagation
- The input layer is indexed by $l=0$ so that $\mathbf{a}^{(0)}=\mathbf{x}$. The output layer is defined as $\mathbf{a}^{(L)}(\mathbf{x}) = \vec f(\mathbf{x}; \theta)$. 

- For each $1\le l \le L$, 
\begin{align}
&\mathbf{z}^{(l)}=\mathbf{W}^{(l)}\mathbf{a}^{(l-1)}+\mathbf{b}^{(l)} \\
&\mathbf{a}^{(l)}=g(\mathbf{z}^{(l)})
\end{align}

- At each layer, we first apply linear transformation and then apply sigmoid function componentwise. So $\mathbf{a}^{(L)}(\mathbf{x})$ is ...

In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/ANN_forward.png?raw=true', width=500))

### Loss function

To tune the weights and biases of a network of sigmoid neurons, we need to select a loss function.

For simplicity, we first consider the square loss
$$L(\theta; \mathcal{D}) =\frac{1}{2N}\sum_{i=1}^N \ell(\vec f(\mathbf{x}^{(i)};\theta),y^{(i)})=\frac{1}{2N}\sum_{i=1}^N\|\mathbf{a}^{(L)}(\mathbf{x}^{(i)})-\mathbf{y}^{(i)}\|^2 $$

Here $\mathbf{y}^{(i)}$ is the one-hot vector of the training label $y^{(i)}$, i.e. in MNIST dataset, the labels are coded as follows:
$$\text{digit 0} = \begin{bmatrix}1 \\0 \\ \vdots \\0 \end{bmatrix}, \text{digit 1} = \begin{bmatrix}0 \\1 \\ \vdots \\0 \end{bmatrix}, \dots,\text{digit 9} = \begin{bmatrix}0 \\0 \\ \vdots \\1 \end{bmatrix}$$

Therefore, by varying the weights and biases, we try to minimize the difference between each network output $\mathbf{a}^{(L)}(\mathbf{x}^{(i)})$ and one of the vectors above (associated
to the training class that $\mathbf{x}^{(i)}$ belongs to).




## The backpropagation algorithm
The goal here is to compute all the
partial derivatives $\frac{\partial L(\theta; \mathcal{D})}{\partial \mathbf{W}^{(l)}}, \frac{\partial L(\theta; \mathcal{D})}{\partial \mathbf{b}^{(l)}}$, which are non-trivial at all. 


In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/ANN_output.png?raw=true', width=500))

To simplify the task a bit, we consider the sameple error for only $\mathbf{x}^{(i)}$, $C_i\triangleq\ell(\vec f(\mathbf{x}^{(i)};\theta),y^{(i)})=\|\mathbf{a}^{(L)}(\mathbf{x}^{(i)})-\mathbf{y}^{(i)}\|^2 $

Therefore, these derivatives are summation of each individual derivative,
\begin{align}
&\frac{\partial L(\theta; \mathcal{D})}{\partial \mathbf{W}^{(l)}}=\frac{1}{N}\sum_{i=1}^N \frac{\partial C_i}{\partial \mathbf{W}^{(l)}}  \\
&\frac{\partial L(\theta; \mathcal{D})}{\partial \mathbf{b}^{(l)}} =\frac{1}{N}\sum_{i=1}^N \frac{\partial C_i}{\partial \mathbf{b}^{(l)}}
\end{align}

### Output layer first
Let's start with the output layer, 
By chain rule, we have
\begin{align}
\frac{\partial C_i}{\partial w^{(\color{red}L)}_{jk}} &= \frac{\partial C_i}{\partial a_j^{(\color{red}L)}}\cdot \frac{\partial a_j^{(\color{red}L)}}{\partial z_j^{(\color{red} L)}}\cdot \frac{\partial z_j^{(\color{red}L)}}{\partial w^{(\color{red}L)}_{jk}} \\
&= \left(a_j^{(L)} - y_j^{(i)}\right) \cdot g'(z_j^{(L)})\cdot a_k^{(L-1)}
\end{align}

\begin{align}
\frac{\partial C_i}{\partial b^{(\color{red}L)}_{j}} &= \frac{\partial C_i}{\partial a_j^{(\color{red}L)}}\cdot \frac{\partial a_j^{(\color{red}L)}}{\partial z_j^{(\color{red} L)}}\cdot \frac{\partial z_j^{(\color{red}L)}}{\partial b^{(\color{red}L)}_{j}} \\
&= \left(a_j^{(L)} - y_j^{(i)}\right) \cdot g'(z_j^{(L)})
\end{align}

where $a_j^{(L)}=g(z_j^{(L)})$ and $z_j^{(L)}=\sum_{k'}w^{(L)}_{jk'}a_{k'}^{(L-1)}+b_j^{(L)}$. 

Here we want to interpret the formula above. 
The rate of change of $C_i=\ell(\vec f(\mathbf{x}^{(i)};\theta),y^{(i)})$ depends on three factors,

- $\left(a_j^{(L)} - y_j^{(i)}\right)$, how much current
output is off from desired output. 

- $g'(z_j^{(L)})$, how fast the neuron reacts
to changes of its input.

- $a_k^{(L-1)}$, contribution from neuron $k$ from $L-1$. 

If the activation function $g$ is the sigmoid function, even the current output is far away from the desired output, $ w^{(L)}_{jk}$ may learn slowly if the input
neuron is in low-activation ($a_k^{(L-1)}\approx 0$) or the output neuron has “saturated”,
i.e., is in either high- or low-activation since in both cases $\sigma'(z_j^{(L)})\approx 0$.



---


The beauty of this algorithm is it can be vectorized. 
\begin{align}
&\frac{\partial C_i}{\partial \mathbf{W}^{(\color{red}L)}} = \left(\underbrace{\left(\mathbf{a}^{(L)}-\mathbf{y}^{(i)}\right)}_{\in \mathbb{R}^{k\times1}}\circ \underbrace{g'(\mathbf{z}^{(L)})}_{\in \mathbb{R}^{k\times 1}}\right)\cdot \underbrace{(\mathbf{a}^{(L-1)})^\top}_{\mathbb{R}^{1\times h^{(L-1)}}} \in \mathbb{R}^{k\times h^{(L-1)}} \\ 
& \frac{\partial C_i}{\partial \mathbf{b}^{(\color{red}L)}} = \underbrace{\left(\mathbf{a}^{(L)}-\mathbf{y}^{(i)}\right)}_{\in \mathbb{R}^{k\times1}}\circ \underbrace{g'(\mathbf{z}^{(L)})}_{\in \mathbb{R}^{k\times 1}} \in \mathbb{R}^{k\times 1}
\end{align}

For convenience, define the auxiliary quantity $\delta^{(L)}$,
\begin{align}
\delta^{(L)} \triangleq \left(\mathbf{a}^{(L)}-\mathbf{y}^{(i)}\right)\circ g'(\mathbf{z}^{(L)})
\end{align}

Then 
\begin{align}
\frac{\partial C_i}{\partial \mathbf{W}^{(L)}} =\delta^{(L)} \cdot (\mathbf{a}^{(L-1)})^\top
\end{align}


### What about layer L − 1 (and further inside)?

In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/ANN_L_1.png?raw=true', width=500))

By chain rule again, 

\begin{align}
\frac{\partial C_i}{\partial w_{kq}^{(L-1)}} &=\sum_j\frac{\partial C_i}{\partial a^{(L)}_j}\cdot \frac{\partial a^{(L)}_j}{\partial w_{kq}^{(L-1)}}\\ 
&= \sum_j\frac{\partial C_i}{\partial a^{(L)}_j}\cdot \frac{\partial a^{(L)}_j}{\partial a^{(L-1)}_k}\cdot \frac{\partial a^{(L-1)}_k}{\partial w_{kq}^{(L-1)}} \\ 
&= \sum_j \left(a_j^{(L)} - y_j^{(i)}\right) \cdot g'(z_j^{(L)})w_{jk}^{(L)} \cdot g'(z_k^{(L-1)})a_q^{(L-2)}
\end{align}
The middle term $\frac{\partial a^{(L)}_j}{\partial a^{(L-1)}_k}$ is the link between layers $L$ and $L − 1$. 

---
Similarly in vector form, 
\begin{align}
\frac{\partial C_i}{\partial \mathbf{W}^{(L-1)}}= \left(\underbrace{(\mathbf{W}^{(L)})^\top \cdot\delta^{(L)}}_{\in \mathbb{R}^{h^{(L-1)}\times 1}} \circ \underbrace{g'(\mathbf{z}^{(L-1)})}_{\in \mathbb{R}^{h^{(L-1)}\times 1}}\right)\cdot \underbrace{(\mathbf{a}^{(L-2)})^\top}_{\in \mathbb{R}^{1\times h^{(L-2)}}} \in \mathbb{R}^{h^{(L-1)} \times h^{(L-2)}}
\end{align}

For convenience, define the auxiliary quantity $\delta^{(L-1)}$,

\begin{align}
\delta^{(L-1)} \triangleq \left((\mathbf{W}^{(L)})^\top \cdot \delta^{(L)}\right)\circ g'(\mathbf{z}^{(L-1)})
\end{align}

Then 
\begin{align}
\frac{\partial C_i}{\partial \mathbf{W}^{(L-1)}} = \delta^{(L-1)} \cdot (\mathbf{a}^{(L-2)})^\top
\end{align}





In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/ANN_l.png?raw=true', width=700))

### Layer $l$
As we move further inside the network (from the output layer), we will need to compute more and more links between layers, 
\begin{align}
\frac{\partial C_i}{\partial w_{qr}^{(l)}}  = \sum_{j,k,\dots,p}\frac{\partial C_i}{\partial a^{(L)}_j}\cdot \frac{\partial a^{(L)}_j}{\partial a_{k}^{(L-1)}}\dots \frac{\partial a^{(l+1)}_p}{\partial a_{q}^{(l)}}\frac{\partial a^{(l)}_q}{\partial w_{qr}^{(l)}}
\end{align}


Similarly, 
we can define $\delta^{(l)}$,
\begin{align}
\delta^{(l)}= \left((\mathbf{W}^{(l+1)})^\top \cdot \delta^{(l+1)}\right)\circ g'(\mathbf{z}^{(l)})
\end{align}
so can calculate $\delta^{(l)}$ iteratively. 

Similarly in vector form, 
\begin{align}
\frac{\partial C_i}{\partial \mathbf{W}^{(l)}}= \delta^{(l)}\cdot (\mathbf{a}^{(l-1)})^\top.
\end{align}

### To summerize backpropagation algorithm

The products of the link terms may be computed iteratively from right to left. It leads to an efficient algorithm for computing these derivatives $\frac{\partial C_i}{\partial w^{(l)}_{qr}}, \frac{\partial C_i}{\partial b^{(l)}_{q}}$ at layer $l$

- Forward propagation: Feedforward $\mathbf{x}^{(i)}$ to obtain all neuron inputs and outputs,

   \begin{align}
    &\mathbf{a}^{(0)}=\mathbf{x}^{(i)}, \\  
    &\mathbf{a}^{(l)}=g(\mathbf{W}^{(l)}\mathbf{a}^{(l-1)}+\mathbf{b}^{(l)})  \ \text{for } l =1, \dots, L
    \end{align}

- Backpropagate the network to compute,

\begin{align}
&\delta^{(L)}=\left(\mathbf{a}^{(L)}-\mathbf{y}^{(i)}\right)\circ g'(\mathbf{z}^{(L)}) \\ 
&\delta^{(l)}= \left((\mathbf{W}^{(l+1)})^\top \cdot \delta^{(l+1)}\right)\circ g'(\mathbf{z}^{(l)}), \ \text{for } l =L-1, \dots, 1
\end{align}

- Compute $\frac{\partial C_i}{\partial \mathbf{W}^{(l)}}, \frac{\partial C_i}{\partial \mathbf{b}^{(l)}}$ for every layer $l$ ,
\begin{align}
&\frac{\partial C_i}{\partial \mathbf{W}^{(l)}} = \delta^{(l)}\cdot (\mathbf{a}^{(l-1)})^\top \\ 
& \frac{\partial C_i}{\partial \mathbf{b}^{(l)}}  = \delta^{(l)}
\end{align}



## Training with gradient based method
### Stochastic gradient descent revisited. 

- Initialize all the weights $\mathbf{W}^{(l)}$ and biases $\mathbf{b}^{(l)}$, where $\theta = \left\{\mathbf{W}^{(l)}, \mathbf{b}^{(l)}\right\}_{l=1,\dots,L}$.

- Randomly shuffle training dataset and for each training sample $(\mathbf{x}^{(i)}, y^{(i)})$, 
   
    - Use backpropagation to compute these gradient $\frac{\partial C_i}{\partial \theta}$.

    - Update the weights and biases using learning rate $\eta>0$ 
    $$ \theta\leftarrow \theta - \eta \cdot \frac{\partial C_i}{\partial \theta}$$

  
  This completes one epoch in the training process.

- Repeat the preceding step until convergence.  

### Mini-batch 

- Initialize all the weights $w_{jk}^{(l)}$ and biases $b_j^{(l)}$. where $\theta = \left\{\mathbf{W}^{(l)}, \mathbf{b}^{(l)}\right\}_{l=1,\dots,L}$.

- Randomly partition the training dataset $\mathcal{D}$ into $M$ batches, $\mathcal{B}_1, \dots \mathcal{B}_M$ with size $B$. So $M=\left \lceil{\frac{N}{B}}\right \rceil $. 

- For each iteration $s=1,\dots,M$, 
   - compute the gradient of loss functions restricted to batch $\mathcal{B}_j$, 
   \begin{align}
   &\frac{\partial L(\theta; \mathcal{B}_s)}{\partial \theta} = \frac{1}{B}\sum_{i\in \mathcal{B}_s}\frac{\partial C_i}{\partial \theta}\approx  \frac{\partial L(\theta; \mathcal{D})}{\partial \theta} 
   \end{align}
   
   -  Update the weights and biases using learning rate $\eta>0$
  $$ \theta\leftarrow \theta - \eta \cdot \frac{\partial L(\theta; \mathcal{B}_s)}{\partial \theta}$$

  

  This completes one epoch in the training process.

- Repeat the preceding step until convergence.   

Note 

1. An epoch means training the neural network with all the training data for one cycle. In an epoch, we use all of the data exactly once.

2. For each complete epoch, we have several iterations. Iteration is the number of batches or steps through the randomly partitioned training data, needed to complete one epoch.

3. choice of learning rate $\eta$ is crucial; we have talked about it in Lecture7. 


   

In [None]:
Image(url='https://github.com/yexf308/MAT592/blob/main/image/SGD_learning.png?raw=true', width=1200)

SGD/Mini-batch has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another.  In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum. 
### SGD with momentum
\begin{align}
&v^t =\gamma v^{t-1}+\eta \frac{\partial L(\theta; \mathcal{B}_s)}{\partial \theta}|_{\color{red}{\theta=\theta^{t-1}}} \\ 
& \theta^t = \theta^{t-1} -v^t
\end{align}

- the momentum $\gamma$ is typically set to 0.9. 

- accelerate the standard SGD and converge faster; also simple to implement

The momentum name comes from an analogy to physics, such as ball accelerating down a slope. In the case of weight updates, we can think of the weights as a particle traveling through parameter space which incurs acceleration from the gradient of the loss.

### SGD with Nesterov momentum
Nesterov Momentum is a slightly different version of the momentum update that has recently been gaining popularity. In this version we’re first looking at a point where current momentum is pointing to and computing gradients from that point.
\begin{align}
&v^t =\gamma v^{t-1}+\eta \frac{\partial L(\theta; \mathcal{B}_s)}{\partial \theta}|_{\color{red}{\theta =\theta^{t-1}-\gamma v^{t-1}}} \\ 
& \theta^t = \theta^{t-1} -v^t
\end{align}


In [None]:
Image(url='https://github.com/yexf308/MAT592/blob/main/image/momentum.png?raw=true', width=900)
# from https://ruder.io/. 

### More gradient-based methods 
see [this webside](https://ruder.io/optimizing-gradient-descent/), like Adagrad, and Adam. Will discuss these more in more advanced classes.