# Multi-layer perceptrons

A perceptron produces a single output based on several real-valued inputs by forming a linear combination using its input weights and passing the output through a nonlinear activation function. A perceptron is computed as a linear combination of all the nodes from the previous layers as
$y=\delta \left ( \sum_{0}^{n} w_ix_i \right ) = \delta \left ( w^Tx \right )$ where `w` denotes the vector of weights, `x` is the vector of inputs, $x_0 = 1$, $w_0$ is the bias weight, and \delta is the non-linear activation function.

A multilayer perceptron (MLP) is a deep, artificial neural network. It is composed of more than one perceptron. They are composed of an input layer to receive the signal, an output layer that makes a decision or prediction about the input, and in between those two, an arbitrary number of hidden layers that are the true computational engine of the MLP. MLPs with one hidden layer are capable of approximating any continuous function.

Multilayer perceptrons are often applied to supervised learning problems: they train on a set of input-output pairs and learn to model the correlation (or dependencies) between those inputs and outputs. Training involves adjusting the parameters, or the weights and biases, of the model in order to minimize error. Backpropagation is used to make those weigh and bias adjustments relative to the error, and the error itself can be measured in a variety of ways, including by root mean squared error (RMSE).

In the forward pass, the signal flow moves from the input layer through the hidden layers to the output layer, and the decision of the output layer is measured against the ground truth labels.

In the backward pass, using backpropagation and the chain rule of calculus, partial derivatives of the error function w.r.t. the various weights and biases are back-propagated through the MLP. That act of differentiation gives us a gradient, or a landscape of error, along which the parameters may be adjusted as they move the MLP one step closer to the error minimum. This can be done with any gradient-based optimisation algorithm such as stochastic gradient descent.

#### Examples of MLPs

1) A multi-layer perceptron that has a hidden layer.
 - Input layer has 2 nodes
 - Hidden layer has 2 nodes
 - Output layer has 1 node
 
<img src="images/fig1.png" width="500">

 
2) A multi-layer perceptron that has two hidden layers.
 - Input layer has 2 nodes
 - Hidden layer 1 has 2 nodes
 - Hidden layer 2 has 3 nodes
 - Output layer has 1 node

<img src="images/fig2.png" width="600">

3) Giả sử chúng ta cần thiết kế một MLP để dự đoán giá nhà dựa vào hai tiêu chí: diện tích nhà ($m^2$) và khoảng cách đến trung tâm thành phố (km).

Dựa vào yêu cầu bài toán, chúng ta có thể xác định input layer có 2 node và output layer có 1 node. Giả sử chúng ta muốn network có 1 hidden layer gồm 2 node, network cho bài toán này cho cấu trúc như sau:

<img src="images/fig3.png" width="600">

Hai giá trị $a_1$ và $a_2$ được tính như sau:

$
a_1 = \delta \left ( x_1  \times w_1  + x_2  \times w_3  + x_3  \times w_5 \right )\\
a_2 = \delta \left ( x_1  \times w_2  + x_2  \times w_4  + x_3  \times w_6 \right )
$

Để thấy rõ hơn các bước tính toán cho giá trị $a_1$ và $a_2$, network có thể vẽ lại chi tiết hơn như sau:

<img src="images/fig4.png" width="600">

Lúc này,
$
z_1 = x_1  \times w_1  + x_2  \times w_3  + x_3  \times w_5\\
z_2 = x_1  \times w_2  + x_2  \times w_4  + x_3  \times w_6\\
a_1 = \delta \left ( z_1 \right )\\
a_2 = \delta \left ( z_2 \right )
$

Hàm activation $\delta \left ( . \right )$ được dùng phổ biến hiện nay là hàm ReLU, được định nghĩa `y = max(0, x)`. Ý nghĩa của hàm ReLU là khi giá trị `x` lớn hơn `0` thì trả về chính giá trị đó; ngược lại trả về `0`. Hàm ReLU được phác họa như sau:

<img src="images/fig5.png" width="600">

Khi xác định dùng hàm ReLU, giá trị $a_1$ và $a_2$ được tính như sau:

$
a_1 = ReLU \left ( z_1 \right )\\
a_2 = ReLU \left ( z_2 \right )
$

Network chúng ta thiết kế có 9 tham số. Khi network được khởi tạo, 9 thám số này 