# ---------------- Neural Networks & Gradient Descent Algorithm Tutorial ----------------

**Author:** ClaemWu  
**Affiliation:** HKBU AEF MscDABE  

## Overview

This tutorial provides detailed examples explaining the concepts and working principles of neural networks, as well as the derivation of gradient descent.

**Note:** This tutorial is intended for learning and reference purposes only and does not carry any academic credit or certification.

# Multi-layer Perceptron (MLP): Feedforward Artificial Neural Network (NN)

The **Multilayer Perceptron (MLP)** represents one of the simplest and most fundamental architectures in neural network design. Often considered the essential **building block** or **computational cornerstone** of modern deep learning, its elegant structure forms the basis upon which far more complex models are constructed.

It comprises of:
- An **input layer** that receives the initial data.
- One or more **hidden layers** that perform the core computation.
- An **output layer** that produces the final prediction or classification.

# Neural Network Algorithm Implementation

<table>
<tr>
<td width="50%" valign="top">

![Network Architecture](3_layerNN.png)
*3-Layer Neural Network*

</td>
<td width="50%" valign="top">

### Matrix Dimension Flow
The dimensions ensure compatible operations in the forward pass:

\[
\begin{aligned}
\text{Input:} & \quad \mathbf{X} && (m, 2) \\
\text{Layer 1:} & \quad \mathbf{Z}^{[1]} = \mathbf{X} (\mathbf{W}^{(1)})^\top && \rightarrow (m, 3) \\
& \quad \mathbf{A}^{[1]} = \sigma(\mathbf{Z}^{[1]}) && \rightarrow (m, 3) \\
\text{Layer 2:} & \quad \mathbf{Z}^{[2]} = \mathbf{A}^{[1]} (\mathbf{W}^{(2)})^\top && \rightarrow (m, 3) \\
& \quad \mathbf{A}^{[2]} = \sigma(\mathbf{Z}^{[2]}) && \rightarrow (m, 3) \\
\text{Output:} & \quad \mathbf{Z}^{[3]} = \mathbf{A}^{[2]} (\mathbf{W}^{(3)})^\top && \rightarrow (m, 1) \\
& \quad \mathbf{A}^{[last]} = \mathbf{\hat{Y}} = \sigma(\mathbf{Z}^{[3]}) && \rightarrow (m, 1)
\end{aligned}
\]

</td>
</tr>
</table>

## Model Parameter Specification

This section defines the key weight matrices for the 3-layer neural network.

**Layer 1 (Input to Hidden Layer):**
*   **Weight Matrix:** $\mathbf{W}^{(1)}$
*   **Dimensions:** $3 \times 2$
*   **Description:** This matrix connects the $2$ input neurons to the $3$ neurons in the first hidden layer. Each element $w^{(1)}_{ij}$ represents the weight from input neuron $j$ 1~2 to hidden neuron $i$ 1~3.

*   **Activation Function (Sigmoid):**
    \[
    $\mathbf{A}^{[1]} = \sigma(\mathbf{Z}^{[1]}) = \frac{1}{1 + e^{-\mathbf{Z}^{[1]}}}$
    \]
    The sigmoid function transforms each element $z^{[1]}_{ij}$ of the linear output $\mathbf{Z}^{[1]}$ into an activation value $a^{[1]}_{ij}$ between 0 and 1. This introduces non-linearity, allowing the network to learn complex patterns. $\mathbf{A}^{[1]}$ has shape $(m, 3)$.

**Layer 2 (Hidden to Hidden):**
*   **Weight Matrix:** $\mathbf{W}^{(2)}$
*   **Dimensions:** $3 \times 3$
*   **Description:** This matrix connects the $3$ hidden layer1 neurons to the $3$ hidden layer2 neurons. Each element $w^{(2)}_{ij}$ represents the weight from layer1 neuron $j$ 1~3 to layer2 neuron $i$ 1~3.
*   **Activation Function (Sigmoid):**
    \[
    $\mathbf{A}^{[2]} = \sigma(\mathbf{Z}^{[2]}) = \frac{1}{1 + e^{-\mathbf{Z}^{[2]}}}$
    \]
    Applies the same sigmoid activation to $\mathbf{Z}^{[2]}$, producing the second hidden layer's activated output $\mathbf{A}^{[2]}$ with shape $(m, 3)$.

**Layer 3 (Hidden to Output):**
*   **Weight Matrix:** $\mathbf{W}^{(3)}$
*   **Dimensions:** $1 \times 3$
*   **Description:** This matrix connects the $3$ hidden layer 2 neurons to the $1$ output neuron. Each element $w^{(3)}_{j}$ represents the weight from layer2 neuron $j$ 1～3 to the single output neuron.
*   **Output Activation (Sigmoid):**
    \[
    $\mathbf{A}^{[Last]} = \mathbf{\hat{Y}} = \sigma(\mathbf{Z}^{[3]}) = \frac{1}{1 + e^{-\mathbf{Z}^{[3]}}}$
    \]
    Applies the sigmoid function to $\mathbf{Z}^{[3]}$ to produce the final predictions $\mathbf{\hat{Y}}$. Each element $\hat{y}_i$ represents a probability between 0 and 1. $\mathbf{\hat{Y}}$ has shape $(n, 1)$, $\sum(\hat{y}_i) = \sum(Ypred) = 1$.

*Where:*
- $m$ is the sample size (Instance number)
- $\sigma$ = activation function for hidden layers (e.g., ReLU, sigmoid, tanh)
- $\sigma_{\text{out}}$ = output layer activation function (e.g., linear for regression, sigmoid for binary classification)
- $^\top$ = transpose operation

<table>
<tr>
<td width="70%" valign="top">

# Gradient Decent Process 

In Neural Networks Context we have a **loss function** $ L $, which is a function of all weights $W_1, W_2$.

The gradient of loss function is the derivative of $L$ with respect to model parameters $w$ $\frac{\partial L}{\partial W}$.

What the Gradient Tells Us:
- **Rate of change**: How fast the loss function changes in each weight direction
- **Magnitude of adjustment**: How much each weight should be adjusted
- **Direction of adjustment**: Whether each weight should be increased or decreased

> **Key Point**:    
> 1.The gradient points in the direction of the **steepest increase** of the loss function.  
> 2.Since our objective is to **minimize** the loss, we adjust the weights in the **opposite direction** of the gradient.  
> 3.Gradient = 0 at min $l$ (a flat slope): $L$ won’t change for smalls changes in $w$ .



</td>
<td width="20%" valign="top">

![Network Architecture](gradient.png)
It measures how fast $L$ changes for a change in parameter $w$.  (i.e., slope of the surface) 


</td>
</tr>
</table>

## How is the Gradient Calculated? 

### Partial Derivatives

A **partial derivative** measures the rate of change of a multivariable function with respect to one variable, while holding others constant.

**Example**: Room temperature $T$ depends on:
- AC setting $x1$
- Outdoor temperature $x2$

Then:
- $\frac{\partial T}{\partial x1}$ represents: When outdoor temperature $x2$ is fixed, how much does indoor temperature change per unit increase in AC setting?
- $\frac{\partial T}{\partial x2}$ represents: When AC setting $x1$ is fixed, how much does indoor temperature change per unit increase in outdoor temperature?

### Chain Rule
Neural network computations are nested layer by layer, so we need the **chain rule**.

**Example**: $\hat{y} = f(g(x))$

Then: $\frac{dy}{dx} = \frac{dy}{dg} \times \frac{dg}{dx}$

In neural networks, we use the chain rule to propagate gradients backward through the network—this is called **backpropagation**.

<table>
<tr>
<td width="60%" valign="top">

## Scenario: House Price Prediction

We aim to predict house prices based on two features:

1. **House Area** (square meters)
2. **House Age** (years)

We create a **2-layer neural network** to process the data and make predictions:
- **Hidden layer**: 3 neurons
- **Output layer**: 1 neuron

### Weight Vector
$\mathbf{W}^{[1]} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \end{bmatrix}$  
$\mathbf{W}^{[2]} = \begin{bmatrix} 0.7 & 0.8 & 0.9 \end{bmatrix}$

### Bias Vector
$\mathbf{b}^{[1]} = \begin{bmatrix} 0.1 & 0.2 & 0.3 \end{bmatrix}$  
$\mathbf{b}^{[2]} = \begin{bmatrix} 0.5 \end{bmatrix}$

</td>
<td width="40%" valign="top">

![Network Architecture](gradient2.png)


</td>
</tr>
</table>

step 1 数据（scaled）输入 [2.0,1.0]

step 2 隐藏层神经元计算
1.hidden layer 线形转换 $\mathbf{Z}^{[1]}$ = [ 2.0, 1.0 ] x $\begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \end{bmatrix} ^\top $ + $\begin{bmatrix} 0.1 & 0.2 & 0.3 \end{bmatrix}$ = [ 0.5, 1.2, 1.9 ]  
2.hidden layer activation $\mathbf{A}^{[1]} = \sigma(\mathbf{Z}^{[1]})$ = [ 0.62, 0.77, 0.87]

step 3 输出层神经元计算
1.output layer linear transfer $\mathbf{Z}^{[2]}$ = [ 0.62, 0.77, 0.87] x $\begin{bmatrix} 0.7 & 0.8 & 0.9  \end{bmatrix} ^\top $ + $\begin{bmatrix} 0.5 \end{bmatrix}$ = [ 1.833 ]  
2.output layer activation $\mathbf{A}^{[last]} = \sigma(\mathbf{Z}^{[2]})$ = [ 0.91 ]

step 4 损失函数计算  
1.选择MSE作为损失函数 $ L = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2$  
2.定义学习率 $n$ = 0.1

step 5 梯度推导流程
  1. 前向传播（计算预测值）
     输入 → 隐藏层计算 → 输出层计算 → 预测值
  
  2. 计算损失
     损失$L$ = $\frac{1}{2}$(预测值 - 真实值)²
  
  3. 反向传播
     计算每层各个权重的梯度（责任）
  
  4. 更新权重
     新权重 = 旧权重 - 学习率 × 梯度
  
  5. 重复直至最优
结束训练





## Neural Network Forward Propagation Example

### Step 1: Input Data (Scaled)
**Input**: $\mathbf{X} = \begin{bmatrix} 2.0 & 1.0 \end{bmatrix}$

### Step 2: Hidden Layer Computation

#### 1. Linear Transformation in Hidden Layer
$$
\mathbf{Z}^{[1]} = \mathbf{X} \cdot (\mathbf{W}^{[1]})^\top + \mathbf{b}^{[1]}
$$

Calculation:
$$
\mathbf{Z}^{[1]} = \begin{bmatrix} 2.0 & 1.0 \end{bmatrix} \times 
\begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \end{bmatrix}^\top + 
\begin{bmatrix} 0.1 & 0.2 & 0.3 \end{bmatrix} 
= \begin{bmatrix} 0.5 & 1.2 & 1.9 \end{bmatrix}
$$

#### 2. Activation in Hidden Layer
Using sigmoid activation $\sigma$:
$$
\mathbf{A}^{[1]} = \sigma(\mathbf{Z}^{[1]}) = \begin{bmatrix} 0.62 & 0.77 & 0.87 \end{bmatrix}
$$

### Step 3: Output Layer Computation

#### 1. Linear Transformation in Output Layer
$$
\mathbf{Z}^{[2]} = \mathbf{A}^{[1]} \cdot (\mathbf{W}^{[2]})^\top + \mathbf{b}^{[2]}
$$

Calculation:
$$
\mathbf{Z}^{[2]} = \begin{bmatrix} 0.62 & 0.77 & 0.87 \end{bmatrix} \times 
\begin{bmatrix} 0.7 & 0.8 & 0.9 \end{bmatrix}^\top + 
\begin{bmatrix} 0.5 \end{bmatrix} 
= \begin{bmatrix} 1.833 \end{bmatrix}
$$

#### 2. Activation in Output Layer
$$
\mathbf{A}^{[2]} = \sigma(\mathbf{Z}^{[2]}) = \begin{bmatrix} 0.91 \end{bmatrix}
$$

**Final Prediction**: $\hat{y} = 0.91$

### Step 4: Loss Function Computation

#### 1. Mean Squared Error (MSE) Loss
$$
L = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2
$$

#### 2. Learning Rate
Learning rate: $\eta = 0.1$


### Step 5: Gradient Descent Training Pipeline

#### 1. Forward Propagation
Input → Hidden Layer Computation → Output Layer Computation → Prediction

#### 2. Loss Calculation
Compute loss: $L = \frac{1}{2}(\hat{y} - y)^2$

#### 3. Backward Propagation
Calculate gradients for each weight in every layer (responsibility assignment)
z1 → a1 → z2 → y_pred(a2) → L

### 3.1 Compute Output Layer Gradients

$$
y_{\text{pred}} = \sigma(z^{[2]})
$$

#### Chain Rule Application
$$
\frac{\partial L}{\partial z^{[2]}} = \frac{\partial L}{\partial y_{\text{pred}}} \times \frac{\partial y_{\text{pred}}}{\partial z^{[2]}}
$$

#### Step 1: Derivative of Prediction
$$
\frac{\partial L}{\partial y_{\text{pred}}} = -(y_{\text{true}} - y_{\text{pred}}) = -(1.0 - 0.91) = -0.09
$$

#### Step 2: Derivative of Sigmoid
$$
\frac{\partial y_{\text{pred}}}{\partial z^{[2]}} = y_{\text{pred}} \times (1 - y_{\text{pred}})
$$
$$
= 0.91 \times (1 - 0.91) = 0.91 \times 0.09 = 0.0819
$$

#### Step 3: Output Layer Error Signal
**$\frac{\partial L}{\partial z^{[2]}}$ (Output layer error signal)**
$$
\frac{\partial L}{\partial z^{[2]}} = (-0.09) \times 0.0819 = -0.007371
$$

#### Step 4: Gradient of Output Layer Weights
$$
z^{[2]} = \mathbf{W}^{[2]} \cdot \mathbf{a}^{[1]} + b^{[2]}
$$
$$
\frac{\partial L}{\partial \mathbf{W}^{[2]}} = \frac{\partial L}{\partial z^{[2]}} \times \frac{\partial z^{[2]}}{\partial \mathbf{W}^{[2]}} = -0.007371 \times \mathbf{a}^{[1]}
$$
$$
= -0.007371 \times [0.62, 0.77, 0.87] = [-0.004570, -0.005676, -0.006413]
$$

#### Summary:
- **First weight $w_{21}$**: For every unit increase, loss decreases by 0.00457
- **Second weight $w_{22}$**: For every unit increase, loss decreases by 0.00568
- **Third weight $w_{23}$**: For every unit increase, loss decreases by 0.00641




### 3.2 Compute Hidden Layer Gradients

#### Network Equations
$$
\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \cdot \mathbf{X} + \mathbf{b}^{[1]}
$$
$$
\mathbf{a}^{[1]} = \sigma(\mathbf{z}^{[1]})
$$

#### Chain Rule Application
**$\frac{\partial L}{\partial \mathbf{z}^{[1]}}$ (Hidden layer error signal)**
$$
\frac{\partial L}{\partial \mathbf{W}^{[1]}} = \frac{\partial L}{\partial z^{[2]}} \times \frac{\partial z^{[2]}}{\partial \mathbf{a}^{[1]}} \times \frac{\partial \mathbf{a}^{[1]}}{\partial \mathbf{z}^{[1]}} \times \frac{\partial \mathbf{z}^{[1]}}{\partial \mathbf{W}^{[1]}}
$$

#### Step-by-Step Computation
1. **Output layer error**: $\frac{\partial L}{\partial z^{[2]}} = -0.007371$
2. **Gradient through weights**: $\frac{\partial z^{[2]}}{\partial \mathbf{a}^{[1]}} = \mathbf{W}^{[2]}$
3. **Sigmoid derivative**: $\frac{\partial \mathbf{a}^{[1]}}{\partial \mathbf{z}^{[1]}} = \sigma'(\mathbf{z}^{[1]}) = \mathbf{z}^{[1]} \odot (1 - \mathbf{z}^{[1]})$
4. **Input contribution**: $\frac{\partial \mathbf{z}^{[1]}}{\partial \mathbf{W}^{[1]}} = \mathbf{X}$

#### Calculation
$$
\frac{\partial L}{\partial \mathbf{W}^{[1]}} = -0.007371 \times \mathbf{W}^{[2]^\top} \odot \sigma'(\mathbf{z}^{[1]}) \times \mathbf{X}
$$
Where:
- $\mathbf{W}^{[2]^\top} = [0.7, 0.8, 0.9]^\top$
- $\sigma'(\mathbf{z}^{[1]}) = [0.2356, 0.1771, 0.1131]^\top$
- $\mathbf{X} = [2.0, 1.0]$

#### Result
$$
\frac{\partial L}{\partial \mathbf{W}^{[1]}} = 
\begin{bmatrix}
-0.002430 & -0.002088 & -0.001500 \\
-0.001215 & -0.001044 & -0.000750
\end{bmatrix}
$$

#### Interpretation:
Each element represents how much the loss changes with respect to a small change in the corresponding hidden layer weight.

### 4. Weight Update
New weight = Old weight − Learning rate × Gradient

### 5. Iterate Until Optimal
Repeat steps 1-4 until convergence or reaching optimal solution