<a href="https://colab.research.google.com/github/yexf308/MAT592/blob/main/23_Neural_Networks1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pylab inline 
import numpy.linalg as LA
from IPython.display import Image

Populating the interactive namespace from numpy and matplotlib


## Tip of the day - Debugging with PDB (optional, for the **advanced** users):

One of the main problems in working with Colab / Jupyter is the lack of a good debugger.

A not so optimal solution (but sometimes better than nothing) is to use Python's built-in debugger. Python comes with a very basic command line based debugger called [PDB](https://docs.python.org/3/library/pdb.html). It has all the basic debugger capabilities but is missing a good interface.

You can drop into debug mode in the middle of any Python code simply by placing the command **pdb.set_trace()** before the line which you want to debug (only after importing the **pdb** package). Once in debug mode, the debugger prompt will appear in which you can use the following commands:

- **l** or **list**: to print the current line and the surrounding code.
- **n** or **next**: to run the current line.
- **c** or **continue**: to continue running until a breakpoint or until the end of the function.
- **q** or **quit**: to stop the run and exit the debugger.
- **b \<number\>**: Place a breakpoint in line \<number\>
- **!\<python expression\>**: to run any python expression.

(This is only a partial list of the PDB's commands. For the full list of commands you can refer to the [official documentation](https://docs.python.org/2/library/pdb.html#debugger-commands), or at [this cheat sheet](https://kapeli.com/cheat_sheets/Python_Debugger.docset/Contents/Resources/Documents/index))

Let us look at an example:

- Add the line **pdb.set_trace()** to the following code just before the **x **= 2**  line, and execute the cell to drop into debug mode.
- Type **l** (followed by **Enter**) to print the current line and surrounding code.
- Type **!print(x)** to print the value of the variable **x** (you can also omit the print command and just use **!x** in this case).
- Type **!x=2** to change the value of **x**.
- Type **n** execute the next line.
- Type **l** again.
- Type **!print(x)** again.
- Type **b 10** to place a breakpoint on the **return x** line.
- Type **c** to run the code until the breakpoint.
- Type **l** again.
- Type **!print(x)** again.
- Type **c** or **q** to quit the debugger with or without finishing to run the code
- After you finish playing with the debugger make sure you remove or comment out the **pdb.set_trace()** line.
- Clear our all the debugger output. You can do so, for example, by pressing the **X** button left to the cell's output area.

In [4]:
import pdb

def func(x):
    x += 2
    pdb.set_trace() 
    x *= 4
    x -= 2
    x /= 2
    
    return x

print(func(3))

> <ipython-input-4-ced2f7bc05d3>(6)func()
-> x *= 4
(Pdb) l
  1  	import pdb
  2  	
  3  	def func(x):
  4  	    x += 2
  5  	    pdb.set_trace()
  6  ->	    x *= 4
  7  	    x -= 2
  8  	    x /= 2
  9  	
 10  	    return x
 11  	
(Pdb) !print(X)
*** NameError: name 'X' is not defined
(Pdb) !print(x)
5
(Pdb) c
9.0


### !! Important

- The **pdb.set_trace()** command can be placed anywhere, but stepping through the code (using the **n** command) is only possible inside functions.
- **You must exit the debugger** (using **c** or **q**) in order to be able to run any other cells.

## Multi-class classification revisited
- **Given**: the training dataset $\mathcal{D}=\{\mathbf{x}^{(i)}, y^{(i)}\}_{i=1}^N$, the input feature $\mathbf{x}^{(i)}\in \mathbb{R}^D$ and the label $y^{(i)}\in \{1,2,\dots, K\}$, for $K$-class classification. 

- **Goal**:  learn a vector-valued function with tunable model parameters $\theta$:
$$\vec f(\mathbf{x}; \theta): \mathbb{R}^D \rightarrow (0,1)^K $$
  - $f_c(\mathbf{x}^{(i)}; \theta)$ is the **predicted probability** that input data $\mathbf{x}^{(i)}$ belongs to Class $c$, for each $c=1, \dots, K$. 

  - Tune $\theta$ using certain optimization algorithm in the hope that, for all samples $1\le i \le N$, 
   $$ \arg\min_{c}-f_c(\mathbf{x}^{(i)}; \theta) = y^{(i)}$$

### Logistic regression
- $\vec f$ is composition of linear and softmax functions. 

- Input data $\mathbf{x}$ is fed into a linear function $\mathbf{z}(\mathbf{x})=\mathbf{W}\mathbf{x} + \mathbf{b}\in \mathbb{R}^K$. $\mathbf{z}$ is the pre-activation vector, $\mathbf{W}\in \mathbb{R}^{K\times D}$ is the weight matrix and $\mathbf{b}$ is the bias vector. Note here we exclude the intercept in the feature for logistic regression.

- model outputs the predicted probabilities for the $K$ classes with the help of softmax activation function $\sigma$. 

$$\vec f(\mathbf{x};\theta)= \text{softmax}(\mathbf{W}\mathbf{x}+\mathbf{b}) = \text{softmax}\left(\begin{bmatrix}\mathbf{w}_1 \mathbf{x}+b_1\\ \vdots  \\ \mathbf{w}_K \mathbf{x} +b_K\end{bmatrix}\right)=\begin{bmatrix} \frac{\exp(\mathbf{w}_1 \mathbf{x}+b_1)}{\sum_{j=1}^K \exp(\mathbf{w}_j \mathbf{x}+b_j)} \\ \vdots  \\ \frac{\exp(\mathbf{w}_K\mathbf{x}+b_K)}{\sum_{j=1}^K \exp(\mathbf{w}_j \mathbf{x}+b_j)} \end{bmatrix}\in(0,1)^K$$

   where weight matrix $\mathbf{W}\in \mathbb{R}^{K\times D}$ and $\mathbf{w}_j$ is the $j$-th row of $\mathbf{W}$. Here $\theta=(\mathbf{W},\mathbf{b})$. 

- **Goal**:  find parameters $\theta$ that minimize the training loss function.

   - The **cross entropy loss** for a given training sample $\mathbf{x}^{(i)}$ with $\vec f(\mathbf{x}^{(i)};\theta)\in (0,1)^K$, $1\le y^{(i)}\le K$:     
   $$\ell(\vec f(\mathbf{x}^{(i)};\theta), y^{(i)})=-\log f_{y^{(i)}}(\mathbf{x}^{(i)};\theta)\ge 0 $$
   Because $-\sum_{j}\mathbf{y}_j^{(i)} \log f_j(\mathbf{x}^{(i)};\theta)=-\log f_{y^{(i)}}(\mathbf{x}^{(i)};\theta)$ and $\mathbf{y}^{(i)}=\mathbf{e}_{y^{(i)}}$ is the one-hot vector of $y^{(i)}$.


   - For each sample, want to make $f_{y^{(i)}}f(\mathbf{x}^{(i)};\theta)$ (i.e., the predicted probability for Class $y^{(i)}$) as large (close to 1) as possible. 

   - The training loss to be minimized is the cross entropy loss on the full training set:
   $$ \min_{\theta} L(\theta; \mathcal{D})= -\frac{1}{N}\sum_{i=1}^N \log f_{y^{(i)}}(\mathbf{x}^{(i)};\theta)$$

   - Usually add a penalty term for reducing overfitting:
      $$ \min_{\theta} L(\theta; \mathcal{D})= -\frac{1}{N}\sum_{i=1}^N \log f_{y^{(i)}}(\mathbf{x}^{(i)};\theta) +\lambda ||\mathbf{W}||_2^2$$

In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/MNIST_LR.png?raw=true', width=1000))
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/comparison_LR.png?raw=true', width=500))

# Artificial neural network: Introduction
## 1. Two-layer neural network
The following is the structure of two layer neural network. 


In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/hidden.png?raw=true', width=400))

### (a) Structure of neural network

- The leftmost is the **inputs layer** with features $x_1, x_2, \dots, x_D$ from the input feature $\mathbf{x}\in \mathbb{R}^D$. Here we have 5 neurons. 

- The grey layers is **hidden neurons layers**, which process inputs from preceding layer and output results for next layer. Here we have 4 hidden neurons.
$$\mathbf{a}(\mathbf{x})=g(\underbrace{\mathbf{W}^{(1)}\mathbf{x}+\mathbf{b}^{(1)}}_{= \mathbf{z}(\mathbf{x})})\in \mathbb{R}^h $$
  Here $\mathbf{W}^{(1)}\in \mathbb{R}^{h\times D}, \mathbf{b}^{(1)}\in \mathbb{R}^h$, the weight matrix and bias vector for **(dense) linear layer**; $g: \mathbb{R}\rightarrow \mathbb{R}, a_j = g(z_j)$: element-wise **(non-linear) activation function**. 


- The rightmost layer is **softmax output layer**. Here we have 3 neurons. 
$$\vec f(\mathbf{x}; \theta) = \text{softmax}(\mathbf{W}^{(2)}\mathbf{a}(\mathbf{x})+\mathbf{b}^{(2)}) = \text{softmax}(\mathbf{W}^{(2)}g(\mathbf{W}^{(1)}\mathbf{x}+\mathbf{b}^{(1)})+\mathbf{b}^{(2)})\in(0,1)^K $$
  Here $\mathbf{W}^{(2)}\in \mathbb{R}^{K\times h}, \mathbf{b}^{(2)}\in \mathbb{R}^K$, the weight matrix and bias vector for the second **linear layer**.

- $\theta =\{\mathbf{W}^{(1)}, \mathbf{b}^{(1)},\mathbf{W}^{(2)}, \mathbf{b}^{(2)} \}$  are model parameters from linear layers. 

- The total cross-entropy loss is  
 $$ L(\theta; \mathcal{D})= -\frac{1}{N}\sum_{i=1}^N \log f_{y^{(i)}}(\mathbf{x}^{(i)};\theta)$$

- **Artificial neural network** is a composition of functions. Each neuron is a function. It accepts inputs from previous
layer and outputs for next layer. Neuron is activated (or fire) if it has high value. 


 





---
One advantage
of the softmax output layer is the interpretation of its outputs as probability. 
However, one issue to use softmax function in the output layer is it is not element-wise.
 That is why some prefer to use sigmoid function with logistic loss.   

The rightmost layer sometimes also uses the **sigmoid function** $\sigma$ and **logistic loss**, 
$$\vec f(\mathbf{x}; \theta) = \sigma(\mathbf{W}^{(2)}\mathbf{a}(\mathbf{x})+\mathbf{b}^{(2)}) = \sigma(\mathbf{W}^{(2)}g(\mathbf{W}^{(1)}\mathbf{x}+\mathbf{b}^{(1)})+\mathbf{b}^{(2)})$$

Then the logistic loss function is 
$$ L(\theta; \mathcal{D}) = -\frac{1}{N}\sum_{i=1}^N \left(\log f_{y^{(i)}}(\mathbf{x}^{(i)}; \theta) + \sum_{j\ne y^{(i)}}\log (1-f_{j}(\mathbf{x}^{(i)}; \theta))\right)$$
With the help of one-hot vector $\mathbf{y}^{(i)}=\mathbf{e}_{y^{(i)}}$, for example, the class 1 is represented by $[1, 0,\dots, 0]^\top$.
$$ L(\theta; \mathcal{D})= -\frac{1}{N}\sum_{i=1}^N \sum_{j=1}^K \left(\mathbf{y}^{(i)}_j \log f_{j}(\mathbf{x}^{(i)}; \theta) +(1-\mathbf{y}^{(i)}_j)\log (1-f_{j}(\mathbf{x}^{(i)}; \theta))\right) $$

**Why not using square loss here?**
In fact, we can. 
$$L(\theta; \mathcal{D}) = \frac{1}{2N}\sum_{i=1}^N \|\vec f(\mathbf{x}^{(i)};\theta)-\mathbf{y}^{(i)}\|^2 $$




---



### (b) Dense linear layer
Dense layer defines a linear mapping, specified by the weights $\mathbf{W}$ and the bias $\mathbf{b}$.

- Both input and output of the layer are **vectors**; if input is not (such as images), then flatten it. 

- Each output $z_j$ is a weighted average of inputs $\mathbf{x}$ plus a bias term. 
   $$ z_j = \mathbf{w}_j \mathbf{x}+b_j$$
   where $ \mathbf{w}_j$ is the $j$-th row of $\mathbf{W}$.

- Full connectivity between the input and output components, justifying the name _dense_; hence, also known as fully-connected layer. 

### (c) Connection to biological neuron


In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/neuron.png?raw=true', width=800))

- Neurons (or nerve cells) are special cells that process and transmit information by electrical signaling (in brain and also spinal cord). 

- A neuron connects to other neurons to form a network. Each neuron cell communicates to between 1000 and 10,000 other neurons. Some group of neurons fire cause some other neuron fire.

- A neuron has three components: 
   - **dendrites**: “input wires”, receive inputs from other neurons. 

   - **cell body**: computational unit.

   - **axon**: “output wire”, sends signal to other neurons.    

### (d) Element-wise activation function

Element-wise activation: for any $\mathbf{z}\in \mathbb{R}^h$, 
  $$ g(\mathbf{z})_j = g(z_j),\qquad j=1,\dots, h$$



In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/activation.png?raw=true', width=900))

- $\text{sigm}(x)$ is the sigmoid function. 

- $\tanh(x)$ is the hyperbolic tangent function. 

- $\text{relu}(x)$ is the rectified linear unit function.

**Blue**: activation function. **Green**: derivative of activation function.

## 2. Deep Neural Network (DNN)
 The neural network is called a **deep neural network** if it has more than two layers (otherwise, it is said to be shallow). It means we have multiple layers of hidden neurons: linear layers with each followed by element-wise non-linear activation. Parameters are collection of weights and biases from linear layers. Like two-layer NN, we could use cross entropy loss in the softmax output layer. 

In [None]:
display(Image(url='https://github.com/yexf308/MAT592/blob/main/image/DNN.png?raw=true', width=900))

### Why DNN can work?

In classification tasks, ML aims to learn an underlaying decision function $\hat{f}$ that maps any data to its correct label. DNN models can approximate a decision function $\hat{f}$ with arbitrary
precision, given sufficient depth. This is guaranteed by the following **universal approximation theorem**. 



---


**Theorem**: For any Lebesgue-integrable function $\hat f: \mathbb{R}^D\rightarrow \mathbb{R}$ and any $\epsilon>0$, there exists a
deep dense (fully-connected) ReLU network $\mathcal{A}$ with width $\le D+4$, , such that the function $F_{\mathcal{A}}$ represented by the network satisfies 
$$ \int_{\mathbb{R}^D} |\hat{f}(\mathbf{x})- F_{\mathcal{A}}(\mathbf{x})|d\mathbf{x}<\epsilon $$



---


On computational side, learning (near-)optimal DNNs can be difficult due to highly non-convex optimization