10 月 21 号
#ANN Multilayer Perceptron

Recall the artificial neuron is a simple biological neuron model in an artificial neural network.

It has a couple of limitations:
1. Can only represent a limited set of functions.
2. Can only distinguish (by the value of its output) the sets of inputs that are linearly separable in the inputs.

即使我们使用了non-linear activation function, single artificial neuron 也只能进行binary classification

for example, let's consider: $f(x) = \frac{1}{1 + e^{-x}}$, and classify $f(x) > 0.5$ as one category. We find that: when $x > 0$, $f(x) > 0.5$. So, our decision boundary is also linear.
($x = \sum\omega_{i} x_{i}$)



#Multi-Layer Perceptron Neural Network

Inorder ro solve the problem, we introduce the Multi-layer perceptron (MLP) neural network.

Multi-layer perceptron (MLP) neural network is a type of feed-forward neural network.

It consists of three types of layers:
 1. Input layer (also called layer i)
 2. Hidden layer (also called layer j)
 3. Output layer (also called layer k)

\begin{matrix}
  &input-layer   &hidden-layer    &output-layer \\
  &x_{1}         &\sum \quad f    &\sum \quad g  \\
  &x_{2}         &\sum \quad f    &\sum \quad g  \\
  &x_{3}         &\sum \quad f    &\sum \quad g
\end{matrix}

**Note:**
here, $f$ and $g$ can be different. In fact, we would like to choose different $f$ and $g$ in real practice, because this can give us different gradient, which may give our better fit.

##How to initialize the weights and biases?
Answer: Initialize them to some small random values.

## How to perform training?

Answer:
 1. Let the network calculate the output with the given inputs (forward propagation)
 2. Calculate the error/loss function (i.e. the difference between the calculated outputs and the target outputs)
 3. Update the weights and biases between the hidden and output layer (backward propagation)
 4. Update the weights and biases between the input and hidden layer (backward propagation)
 5. Go back to step 1

 ## When to stop training?

 Answer:
 1. After a fixed number of iterations through the loop.
 2. Once the training error falls below some threshold.
 3. Stop at a minimum of the error on the validation set.

##How to update the weights and biases?
Assuming the activation function is the sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$. The error function is: $E = \frac{1}{2} \sum (O_{i}-T_{i})^{2}$

The derivative of sigmoid function satisfy: $\frac{d}{dx}\sigma(x) = \sigma(x)(1-\sigma(x))$

consider:
\begin{align}
O_{k} &= \sigma(\theta_{k} + \sum \omega_{i}O_{j}) = \sigma(x), \\
\frac{\partial O_{k}}{\partial \theta_{k}} &= \frac{\partial O_{k}}{\partial (\theta_{k} + \sum \omega_{i}O_{j})} \cdot \frac{\partial (\theta_{k} + \sum \omega_{i}O_{j})}{\partial \theta_{k}} \\
&= \frac{\partial \sigma(x)}{\partial x} \cdot 1 \\
&= \sigma(x)(1-\sigma(x)) \\
&= O_{k}(1-O_{k})
\end{align}

So, we can obtain the gradient of $\theta_{k}$:
\begin{align}
\frac{\partial E}{\partial \theta_{k}} &= \frac{\partial (\frac{1}{2} \sum (O_{i}-T_{i})^{2})}{\partial O_{k}} \cdot \frac{\partial O_{k}}{\partial \theta_{i}} \\
&= (O_{k} - T_{k}) \cdot \frac{\partial O_{k}}{\partial \theta_{i}} \\
&= (O_{k} - T_{k})O_{k}(1-O_{k})
\end{align}

So, by the same way, the gradient of $\theta_{j}$:
\begin{align}
O_{j} &= \sigma(\theta_{j} + \sum \omega_{i}O_{i}) \\
\frac{\partial E}{\partial \theta_{j}}
&= \frac{\partial E}{\partial O_{j}} \cdot \frac{\partial O_{j}}{\partial \theta_{j}} \\
&= (\sum_{k} \frac{\partial E}{\partial (\theta_{k} + \sum \omega_{i}O_{i})} \cdot \frac{\partial (\theta_{k} + \sum \omega_{i}O_{i})}{\partial O_{j}}) \cdot \frac{\partial O_{j}}{\partial \theta_{j}} \\
&= (\sum_{k} \delta_{k}\cdot\omega_{jk}) \cdot \frac{\partial O_{j}}{\partial \theta_{j}} \\
&= O_{j}(1-O_{j})\sum_{k} \delta_{k}\cdot\omega_{jk}
\end{align}

where, $\delta_{k} = (O_{k} - T_{k})O_{k}(1-O_{k})$, which is the gradient of $\theta_{k}$

**Note:**
Here, the second equation is because: $O_{j}$ would influence all output layer nodes. So, when we take the gradient, we need to consider all: $\theta_{k} + \sum \omega_{i}O_{i}$ , and sum them up.

同样的，我们可以得到$\omega_{j}$和$\omega_{k}$的gradient:

\begin{align}
\frac{\partial O_{k}}{\partial \omega_{k}} &= \frac{\partial O_{k}}{\partial (\theta_{k} + \sum \omega_{i}O_{j})} \cdot \frac{\partial (\theta_{k} + \sum \omega_{i}O_{j})}{\partial \omega_{k}} \\
&= \frac{\partial \sigma(x)}{\partial x} \cdot O_{j} \\
&= \sigma(x)(1-\sigma(x))O_{j} \\
&= O_{k}(1-O_{k})O_{j}
\end{align}

**AND:**
\begin{align}
\frac{\partial E}{\partial \omega_{j}}
&= \frac{\partial E}{\partial O_{j}} \cdot \frac{\partial O_{j}}{\partial \omega_{j}} \\
&= (\sum_{k} \frac{\partial E}{\partial (\theta_{k} + \sum \omega_{i}O_{i})} \cdot \frac{\partial (\theta_{k} + \sum \omega_{i}O_{i})}{\partial O_{j}}) \cdot \frac{\partial O_{j}}{\partial \omega_{j}} \\
&= (\sum_{k} \delta_{k}\cdot\omega_{jk}) \cdot \frac{\partial O_{j}}{\partial \omega_{j}} \\
&= O_{j}(1-O_{j})O_{i}\sum_{k} \delta_{k}\cdot\omega_{jk}
\end{align}

**where:**

 $O_{j}$ is the output of layer $j$ , $O_{i}$ is the output of layer $i$ , and $O_{k}$ is the output of layer $k$.

$\theta_{j}$ is the bias of layer $j$ , and $\theta_{k}$ is the bias of layer $k$.

$\omega_{j}$ is the weight of layer $j$ , and $\omega_{k}$ is the weight of layer $k$. Specially,  $w_{ij}$ is the weight connecting node in layer $i$ to node in layer $j$ ; $w_{jk}$ is the weight connecting node in layer $j$ to node in layer $k$.

**From above discussion, we obtain the following formula for updating weights and biases:**

**denote:**

\begin{align}
\delta_{k} &= O_{k}(1-O_{k}) \\
\delta_{j} &= O_{j}(1-O_{j})\sum_{k \in K} \delta_{k}\cdot\omega_{jk}
\end{align}

**Updating rule:**
\begin{align}
w_{jk} &= wjk −ηδ_kO_j \\
 w_{ij} &= wij −ηδ_jO_i \\
 θ_k &= θ_k −ηδ_k \\
 θ_j &= θ_j −ηδ_j
\end{align}

where, $\eta$ is learning rate.



In [None]:
import numpy as np # Import NumPy

class MultiLayerPerceptron:
  def __init__(self):
    """ Multi-layer perceptron initialization """
    self.wij = np.array([    # Weights between input and hidden layer
      [-0.65, 0.64],         # w1, w2
      [1.11, 0.84]           # w3, w4
    ])
    self.wjk = np.array([      # Weights between hidden and output layer
      [0.86],                  # w5
      [-1.38]                  # w6
    ])
    self.tj = np.array([        # Biases of nodes in the hidden layer
      [0.0],                    # theta 1
      [0.0]                     # theta 2
    ])
    self.tk = np.array([[0.0]]) # Bias in the output layer, Theta 3
    self.learning_rate = 0.5    # Eta
    self.max_round = 10000      # Number of rounds

  def sigmoid(self, z, sig_cal=False):
    """ Sigmoid function and the calculation of z * (1-z) """
    # If sig_cal is True, return sigmoid
    if sig_cal: return 1 / (1 + np.exp(-z))

    # If sig_cal is False, return z * (1-z)
    return z * (1-z)

  def forward(self, x, predict=False):
    """ Forward propagation """
    # Get the training example as a column vector. Shape (2,1)
    sample = x.reshape(len(x), 1)

    # Compute the hidden node outputs. Shape (2,1)
    # w1  w2        x1        w1*x1 + w2*x2
    # w3  w4    @   x2   =    w3*x1 + w4*x2
    yj = self.sigmoid(self.wij.dot(sample) + self.tj, sig_cal=True)

    # Compute the output of node in the output layer. Shape (1,1)
    yk = self.sigmoid(self.wjk.transpose().dot(yj) + self.tk, sig_cal=True)

    # If predict is True, return the output of node in the layer node
    if predict: return yk

    # Return (data sample, hidden node outputs, predicted output)
    return (sample, yj, yk)

  def backpropagation(self, values, Tk):
    # values: return by forward
    # Tk: target

    Oi = values[0] # Input sample
    Oj = values[1] # Hidden node outputs
    Ok = values[2] # Predicted output

    """ back propagation """
    # deltak = (Ok-Tk)Ok(1-Ok). Shape (1,1)
    # here, sigmoid would return the derivative by default
    deltaK = np.multiply((Ok- Tk), self.sigmoid(Ok))

    # deltaj = Oj(1-Oj)(deltak)(Wjk). Shape (2,1)
    #
    deltaJ = np.multiply(self.sigmoid(Oj), deltaK[0][0] * self.wjk)

    """ start update weight and bias """
    # wjk = wjk- eta(deltak)(Oj). Shape (2,1)
    self.wjk-= self.learning_rate * deltaK[0][0] * Oj

    # thetak = thetak- eta(deltak). Shape (1,1)
    self.tk-= self.learning_rate * deltaK

    # wij = wij- eta(deltaj)(Oi). Shape (2,2)
    # delta1                    delta1*x1     delta1*x2
    # delta2    @   x1  x2  =   delta2*x1     delta2*x2
    s = self.learning_rate * deltaJ.dot(Oi.T)

    # w1  w2        delta1*x1     delta1*x2
    # w3  w4    -   delta2*x1     delta2*x2
    self.wij-= s

    # thetaj = thetaj- eta(deltaj). Shape (2,1)
    self.tj-= self.learning_rate * deltaJ

  def train(self, X, T):
    """ Training """
    for i in range(self.max_round): # Train max round number of rounds
      for j in range(m):
        # Use all the samples
        # print(f'Iteration: {i+1} and {j+1}')
        values = self.forward(X[j])        # Forward propagation
        self.backpropagation(values, T[j]) # Back propagation

  def print(self):
    print(f'wij: {self.wij}')
    print(f'wjk: {self.wjk}')
    print(f'tj: {self.tj}')
    print(f'tk: {self.tk}')


In [None]:
if __name__ == '__main__':
  m = 4
  X = np.array([ # Input data
      [0, 0],
      [0, 1],
      [1, 0],
      [1, 1]
  ])
  T = np.array([ # Target values
      [0],
      [1],
      [1],
      [0]
  ])
  mlp = MultiLayerPerceptron() # Create an object
  mlp.train(X, T)
  mlp.print()
  for k in range(m):
      Ok = mlp.forward(X[k],True)   # forward propagation
      print(f'y{k}: {Ok}')          # backward propagation

wij: [[-6.25165367  6.22962318]
 [-5.7179006   5.96706892]]
wjk: [[ 9.53377675]
 [-9.2900534 ]]
tj: [[-3.40695349]
 [ 2.8512169 ]]
tk: [[4.41777385]]
y0: [[0.01697257]]
y1: [[0.98413923]]
y2: [[0.98051328]]
y3: [[0.0151785]]


#Handwritten Digits Recognition using MLP

 We will build a MLP Artificial Neural Network to recognize/classify handwritten digits.

 **Terminologies:**
 1. **Training data:**
 The data our model learn from. Sometimes split into training and validation
 data.
 2. **Testing data:**
 The data is kept secret from the model until after it has been trained. Testing data is used to evaluate our model.
 3. **Loss function:**
  A function used to
  *do differentiation to update parameters* during training,
  monitor each epoch to *decide the convergence* during training, *quantify how accurate a model’s predictions were.*
  **The only objective of the neural network is optimizing/minimizing the loss function.**
 4. **Optimization algorithm:**
 It controls exactly **how the parameters are adjusted during training**. (E.g., gradient descent.)


##Dataset
We use the Modified National Institute of Standards and Technology
(MNIST) dataset.

This dataset contains two sets of samples:
 1. Training data: 60000 28 pixel × 28 pixel images of handwritten digits from 0 to 9.
 2. Testing data: 10000 28 pixel × 28 pixel images.


#Procedures

 1. Import the required libraries and define a global variable
 2. Load the data
 3. Explore the data
 4. Build the model
 5. Compile the model
 6. Train the model
 7. Evaluate the model accuracy
 8. Save the model
 9. Use the model
 10. Plotting the confusion matrix

##Build the Model

**how to get the number of parameter between hidden layer?**
\begin{align}
parameter = (prev + 1) \times current
\end{align}
where $parameter$ means trainable parameter number, $prev$ means number of node in previous layer, $current$ means number of node in current layer.

**Remark:**

Usually, we would multiply a small factor in hidden layer when we are updating parameters. (E.x. $\omega_{ij} = \omega_{ij} + (\eta \alpha \nabla E) \cdot x_{ij}$, where $\alpha$ is the small number ($\alpha = 1$), $\eta$ is the learning rate, and $\nabla E$ is the gradient.)

This is because the **gradient would increase a lot during back-propagation**. So, we need multiply a small number ($\alpha$) to **avoid overfitting**.

##Optimization

**Categorical/Multi-class cross-entropy loss:**
\begin{align}
CCE = - \sum_{i \in C} y_{i} \cdot log(p_{i})
\end{align}
where $y$ represents the actual labels, usually the one-hot vector; $p$ represents predictions, usually the output of Softmax.

**Usage:**
1. Mostly used to measure the **distance between two distributions**.
2. Measures **how close the predictions are to the actual labels**.

##Save the Model
 Save the entire model to an HDFS (Hadoop Distributed File System) file.

 The .h5 extension of the file indicates that the model should be **saved in Keras format as an HDFS file**.

In [None]:
model_name = 'digits_recognition_mlp.h5'
model.save(model_name, save_format='h5')
loaded_model = load_model(model_name)

#Analysis on MLP

##Problem: Vanishing Gradient and Exploding Gradient
**Vanishing Gradient:** the gradient is very small, updating is very slow.

**Exploding Gradient:** the updated weight becomes NaN

**For vanishing gradient:**
1. The parameters of the higher layers vary dramatically, whereas the parameters of the lower levels do not change significantly for vanishing (or not at all).
2. During training, the model weights may become zero.
3. The model learns slowly, and after a few cycles, the training may become stagnant.

**For exploding gradient:**
1. The model parameters are growing exponentially.
2. During training, the model weights may become NaN.
3. The model goes through an avalanche learning process.

## Problem: Overfitting and Underfitting
**Overfitting:**
 It refers to a model that models the training data too well. It happens when a model **learns the detail and noise in the training data** to the extent that it negatively impacts the performance of the model on the new data.

**Underfitting:**
 It refers to a model that can neither model the training data nor generalize to new data.

**Avoid Overfitting是MLP的一个重要研究方向**

##How Many Layers and Number of Neurons in Each of These Layers?

**经验公式：**

**Number of layers:**
1. If our data is linearly separable, NO hidden layer at all.
2. If data is less complex and has few dimensions or features,
   neural networks with 1 to 2 hidden layers would work.
3. If data has large dimensions or features, 3 to 5 hidden layers
   can be used to get an optimum solution.

**Number of neurons:**
1. The number of hidden neurons should be between the size of the input layer and the output layer.
2. The most appropriate number of hidden neurons is:
\begin{align}
 \sqrt{input × output}
\end{align}
3. The number of hidden neurons should keep decreasing in subsequent layers to get closer to pattern and feature extraction and identify the target class.

##The effect of weights and bias

Suppose we have the following perceptron:
\begin{align}
 x \rightarrow {Node} \rightarrow f(\omega x + \theta)
\end{align}

Let’s get the output functions by setting w to 0.5, 1, 2, 3, θ to 0, and using sigmoid activation function.

According to the example, we can see that **weights control the steepness of the activation function**.

Now, let’s get another set of output functions by setting w to 1, θ to 0, 1, 2, and 3, and using the sigmoid activation function.

According to the example, we can see that bias is used for shifting the activation function towards the left or right.

#Remark:

**Build the Model**

Layers

 Layer 1: Flatten layer that will flatten 2D image into 1D

 Layer 2: Hidden Dense layer 1 with 128 neurons and ReLU activation

 Layer 3: Hidden Dense layer 2 with 128 neurons and ReLU activation

 Layer 4: Output Dense layer with 10 Softmax outputs.

##ReLU Activation Function:
$f(x) = max (0, x)$

A derivative function and allows for backpropagation while simultaneously making it computationally efficient.

The neurons will only be deactivated if the output of the linear transformation is less than 0.

Accelerates the convergence of gradient descent due to its linear property.

Limitations: **Dying ReLU problem**

**The Dying ReLU Problem**

The negative side of the graph makes the gradient value zero. During the backpropagation process, the weights and biases for some neurons are not updated.

Dead neurons which never get activated. All the negative input values become zero immediately, which decreases the model’s ability to fit or train from the data properly.

##Softmax Function:
 $f(z_{i}) = \frac{e^{z_{i}}}{\sum_{j} e^{z_{j}}}$

Obviously, $\sum f(z_{i}) = 1$, is a probability distribution

**description:**

1. A combination of multiple sigmoid
2. Calculates the relative probabilities. Similar to the sigmoid/
   logistic activation function, the Softmax function **returns the probability of each class**.
3. Most commonly used as an **activation function for the output
   layer of the neural network** in the case of multi-class classification.

##Sigmoid Function:
$f(x) = \frac{1}{1 + e^{-x}}$

**Limitation:** The output of the sigmoid function is not symmetric around zero. So the output of all the neurons will be of the **same sign**. This makes the training of the neural network more difficult and unstable.

##Tanh Activation Function:
$f(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
Very similar to the sigmoid activation function, and even
has the same S-shape with the difference in output range
of-1 to 1.

The output of the tanh activation function is **Zero
centered**; hence we can easily map the output values as
strongly negative, neutral, or strongly positive.

When used in hidden layers, the mean for the hidden
layer comes out to be 0 or very close to it. It helps in
centering the data and makes learning for the next layer
much easier.

##Why Sigmoid and Tanh More Susceptible to Vanishing Gradients?

The gradient values are only significant for range -3 to 3, and the graph gets much flatter in other regions.

**For values greater than 3 or less than -3, the function will have very small gradients**. As the gradient value approaches zero, the network ceases to learn and suffers from the Vanishing gradients.

#How to choose a hidden layer Activation function?

| Neural Network | Commonly Used Activation Function |
| :------------  | :----------------                 |
| Multi-layer Perceptron (MLP) | ReLU activation function |
| Convolutional Neural Network (CNN) | ReLU activation function |
| Recurrent Neural Network (RNN) | Tanh and/or Sigmoid activation function|

#How to choose an Output Activation function?
There are three commonly used activation functions for use in the output layer.
1. Linear
2. Sigmoid/Logistic
3. Softmax

If a problem is a **regression problem**, we should use a **linear activation function**.

If a problem is a **classification problem**, then there are three main types of classification problems, and each may use a different activation function.
1. **Binary classification:** One node, **sigmoid activation**.
2. **Multiclass classification:** One node per class, **softmax activation**.
3. **Multilabel classification:** One node per class, **sigmoid activation**.

**Note:** Multiclass classification makes the assumption that each sample is assigned to one and only one label. Multilabel classification assigns to each sample a set of target labels.

##Backward propagation

To give a basic concept here, for example, we have the below network structure.

Given that $a_i = f_i(w_{i-1} a_{i-1} + \theta_{i-1})$ where $a_0 = x$, $w_i$ is the weight parameter and $f_i$ is the activation function of layer i in the network.

Network: <br>
input $x$  -($w_1$)-> $a_1,f_1$ -($w_2$)-> $a_2, f_2$ -($w_3$)-> $a_3, f_3$ --> $y$ --> Loss function $L$

Let's say we now want to update $w_2$ through backpropagation. Then we will need to compute $w_2 = w_2 - \eta \nabla w_2$

$\nabla w_2 = \frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial f_3}\frac{\partial f_3}{\partial a_3}\frac{\partial a_3}{\partial f_2}\frac{\partial f_2}{\partial a_2}\frac{\partial a_2}{\partial w_2}$

while $\nabla w_3 = \frac{\partial L}{\partial w_3} = \frac{\partial L}{\partial f_3}\frac{\partial f_3}{\partial a_3}\frac{\partial a_3}{\partial w_3}$, and you will find that some computation results can be reused.

The following provides a simple animation showing how gradient descent works on regressing a basic function $3x^2$.