---
## Perceptron and Multilayer Neural Network
#### Language: Python 3.8.8
#### Author: Tianjian Sun
---

### Table of Contents
- [Introduction](#Introduction)
- [Algorithm](#Algorithm)
-   - [Perceptron](#Perceptron)
    - [Multilayer Neural Network](#Multilayer)
- [Illustration](#Illustration)
- [Advantages and Disadvantages](#Advantages)
    - Advantages of Perceptron
    - Disadvantages of Perceptron
    - Advantages of Multilayer Neural Network
    - Disadvantages of Multilayer Neural Network
- [Code and Applications on data sets](#Applications)
    - [Classification problem](#Classification)
    - [Regression problem](#Regression)
--- 

### Introduction <a class="anchor" id="Introduction"></a>

A [Perceptron](https://en.wikipedia.org/wiki/Perceptron) is an algorithm for supervised learning of binary classifiers. It's a simplified model of a biological neuron, and it is a type of linear classifier, i.e., a classifier that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

The perceptron algorithm was invented in 1958 at the Cornell Aeronautical Laboratory by [Frank Rosenblatt](https://en.wikipedia.org/wiki/Frank_Rosenblatt). It's the origin of support vector machines and multilayer networks.


A [multilayer neural network](https://en.wikipedia.org/wiki/Multilayer_perceptron) (or multilayer perceptron ,MLP) is a class of feedforward artificial neural network (ANN). It's an extension of perceptrons, with more layers and non-linear activation functions.

---

### Algorithm <a class="anchor" id="Algorithm"></a>

#### Perceptron <a class="anchor" id="Perceptron"></a>
First a function that maps input $\mathbf {x}$  (a real-valued vector) to an output value $f(\mathbf {x} )$ (a single binary value) is given by

$$
f(\mathbf{x}) = \begin{cases}1 & \text{if }\ \mathbf{w} \cdot \mathbf{x} + b > 0,\\0 & \text{otherwise}\end{cases}
$$

where $\mathbf {w}$  is a vector of real-valued weights, $\mathbf {w} \cdot \mathbf {x} = \sum_{i=1}^{m} {w_{i}x_{i}}$ is the dot product,  $m$ is the number of inputs to the perceptron.

And the algorithm is
1. Initialize random small weights.
2. For each sample $\mathbf{x}_j$ and $y_j$, perform the following steps:
   - Calculate the actual output:
$$
y_j(t) = f[\mathbf{x}_j^T\cdot\mathbf{w}(t)]
$$
   - Update the weights:
$$
\mathbf{w}(t+1) = \mathbf{w}(t) \boldsymbol{+} \cdot  \mathbf{x}_j^T(\mathbf{d} - \mathbf{y}(t))
$$

3. Updates the weights after steps 2, until meets stopping criterias.

#### Multilayer Neural Network <a class="anchor" id="Multilayer"></a>

First, a multilayer neural network has a non-linear activation function in most cases, and *sigmoid* ($\sigma(\cdot)$), *hyperbolic tangent* ($tanh(\cdot)$) and *rectified linear* ($RELU(\cdot)$) activation functions are widely used.
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

$$
tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}
$$

$$
RELU(x) = \begin{cases} x & \text{if}\ x > 0,\\0 & \text{otherwise}\end{cases}
$$

Further, a MLP consists of many layers (an input and an output layer with one or more hidden layers) of non-linear activating nodes. Since MLPs are fully connected, each node in one layer connects with a certain weight to every node in the following layer.

Learning occurs in MLP by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is a supervised learning, and is carried out through [backpropagation](https://en.wikipedia.org/wiki/Backpropagation), a generalization of the least mean squares algorithm in the linear perceptron.

The basic idea of backpropagation is to take advantage of the [Chain Rule](https://en.wikipedia.org/wiki/Chain_rule) to derive derivative of each layers, from end to start.

Error in an output node $j$ in the $n$ th training example by $e_j(n)=d_j(n)-y_j(n)$, where $d$ is the target value and $y$ is the value produced by the perceptron. The node weights can be adjusted by minimizing the sum square error in the entire output, given by

$$\mathcal{E}(n)=\frac{1}{2}\sum_j e_j^2(n)$$

Using gradient descent, the change in each weight of output layer is

$$\Delta w_{ji} (n) = -\gamma\frac{\partial\mathcal{E}(n)}{\partial v_j(n)} y_i(n)$$

where $y_i$ is the output of the previous neuron, $\gamma$ is the *learning rate*, $v_j$ is the output of output layer.

It is easy to prove that for an output node this derivative can be simplified to

$$-\frac{\partial\mathcal{E}(n)}{\partial v_j(n)} = e_j(n)\phi^\prime (v_j(n))$$

where $\phi^\prime$ is the derivative of the activation function described above, which itself does not vary. For hidden nodes, it can be shown that the relevant derivative is

$$-\frac{\partial\mathcal{E}(n)}{\partial v_j(n)} = \phi^\prime (v_j(n))\sum_k -\frac{\partial\mathcal{E}(n)}{\partial v_k(n)} w_{kj}(n)$$

---
### Illustration <a class="anchor" id="Illustration"></a>

**Perceptron**

In one sentence, a perceptron is a simplified model of neuron. It consists four parts: inputs, weights, net sum, and activation function. The following image is from [Wikipedia](https://en.wikipedia.org/wiki/Perceptron#/media/File:Perceptron.svg).

<img src="images/Perceptron.svg" alt="drawing" width="400"/>

**Multilayer Neural Network**

A multilayer neural network is multilayer of perceptrons. It consists an input layer, several hidden layers, output layer, fully connected weights and non-linear activation function. The following image is is from [Wikipedia](https://commons.wikimedia.org/wiki/File:MultiLayerPerceptron.png).

<img src="images/Multilayer-Perceptron-Network.png" alt="drawing" width="400"/>

---
### Advantages and disadvantages <a class="anchor" id="Advantages"></a>
#### Advantage of Perceptron-
- Easy to use and understand.
- Simple
#### Disadvantage of Perceptron-
- Cannot create non-linear boundary
- Output can take on only one of two values

#### Advantage of Multilayer Neural Network-
- Learn non-linear and complex relationships.
- Fit all continous function, if the hidden layer is more than 2.
- Can work on both classification and regression
#### Disadvantage of Perceptron-
- Too many parameters to turn
- Large model complexity if number of layers and neurons are large
- Slow training process if model is large
- Overfitting is easy