```{contents}
```

# Multi-Layer Neural Networks

The discussion moves from **single-layer perceptrons** to **multi-layer perceptrons (MLPs)**—also known as **artificial neural networks (ANNs)**—to overcome limitations of basic models.

Single-layer perceptrons:

* Only support **feedforward propagation**
* Can solve only **linearly separable problems**
* Have no efficient way to update weights

To address this, we move to **multi-layer neural networks** that use:

* ✅ **Forward Propagation**
* ✅ **Backward Propagation** (invented/popularized by Geoffrey Hinton)
* ✅ **Loss Functions**
* ✅ **Optimizers**
* ✅ **Activation Functions**

---

## Architecture of a Multi-Layer Neural Network

Example dataset:

* Inputs/features: IQ, Study Hours, Play Hours → **X₁, X₂, X₃**
* Output: Pass/Fail → **y (1 or 0)**

Network structure:

* **Input Layer** → 3 neurons
* **Hidden Layer 1** → 1 neuron
* **Output Layer** → 1 neuron
  (This example shows a simple 2-layer ANN, but real networks can have many layers and neurons.)

Each layer includes:

* Weights (W)
* Biases (b)
* Neurons

![alt text](..\images\mnn.png)
---

## Forward Propagation (Feedforward)

Forward propagation happens in two main steps **for each neuron**:

###  Step 1: Weighted Sum

$$
z = \sum (x_i \cdot w_i) + b
$$

Example:

* (z = 95×0.01 + 4×0.02 + 4×0.03 + b_1 ≈ 1.151)

### Step 2: Activation Function

Using **sigmoid**:

$$
σ(z) = \frac{1}{1 + e^{-z}}
$$

* Converts value to range **0 to 1**
* Example:
  $$
  σ(1.151) ≈ 0.759
  $$

This output is passed forward to the next layer, and the process repeats until the final output (**ŷ**) is produced.

---

## Loss Function (Error Calculation)

After forward propagation:

* Predicted output: **ŷ = 0.511**
* Actual output: **y = 1**

Error (Loss):
$$
Loss = (y - ŷ)^2 ≈ (1 - 0.511)^2 ≈ 0.49
$$

---

## Backward Propagation (Backprop)

Goal:

* Reduce the loss by updating weights
* Starts from output layer and moves **backwards**

Weights updated in reverse order:

1. Output layer weights (e.g., W₄)
2. Hidden layer weights (e.g., W₁, W₂, W₃)

To do this efficiently, we use:

* **Optimizers** (e.g., Gradient Descent)
* **Loss functions**
* **Activation function derivatives**

---

## Loss Function vs Cost Function

* **Loss Function**: Error for **each individual data point**
* **Cost Function**: Average (or total) loss over **all data points**
  $$
  Cost = \frac{1}{n} \sum (y - \hat{y})^2
  $$

---

## Activations & Optimizers

* **Sigmoid** used for binary outputs
* Other activations (ReLU, Tanh, etc.) will be covered later
* Optimizers (like SGD, Adam) adjust weights during backprop

---

##  The Training Loop

For each record:

1. Forward propagation → compute ŷ
2. Calculate loss
3. Backward propagation → update weights
4. Repeat until loss is minimized

---


## Neuron as a Function

At the core of any neural network is a **neuron**, which is essentially a mathematical function.

For a neuron, the **input-output relationship** is:

$$
y = f(z) = f\Big(\sum_{i=1}^{n} w_i x_i + b \Big)
$$

Where:

* $x_1, x_2, ..., x_n$ are the inputs.
* $w_1, w_2, ..., w_n$ are the **weights** associated with each input.
* $b$ is the **bias**.
* $z = \sum w_i x_i + b$ is called the **logit** or **pre-activation value**.
* $f$ is the **activation function** (e.g., Sigmoid, ReLU, Tanh), introducing **nonlinearity**.
* $y$ is the output of the neuron.

**Intuition:**
Weights determine how important each input is, bias allows shifting the activation, and the activation function lets the neuron “decide” whether to activate or not.

---

## Single Layer vs Multi-Layer

* **Single layer network**: Only one layer of neurons between input and output.
  $$
  \hat{y} = f(W^T x + b)
  $$

  * Limitation: Can only model **linear relationships** (or linear separable problems).

* **Multi-layer network**: Multiple layers of neurons allow **composing functions**, enabling modeling of **nonlinear relationships**.

Mathematically, a **2-layer network** (1 hidden layer) is:

$$
\begin{aligned}
h &= f_1(W_1^T x + b_1) \quad \text{(hidden layer output)} \
\hat{y} &= f_2(W_2^T h + b_2) \quad \text{(final output)}
\end{aligned}
$$

Here:

* $x \in \mathbb{R}^n$ is input
* $h \in \mathbb{R}^m$ is hidden layer activation
* $\hat{y}$ is the predicted output

**Intuition:** Each layer applies a **nonlinear transformation** to the input from the previous layer, building increasingly complex representations.

---

## Forward Propagation

Forward propagation is just **computing outputs layer by layer**:

1. Multiply inputs by weights, add bias → (z = W^T x + b)
2. Apply activation → (a = f(z))
3. Pass (a) as input to the next layer

For (L) layers:

$$
\begin{aligned}
a^1 &= f^1(W^1 x + b^1) \
a^2 &= f^2(W^2 a^1 + b^2) \
&\vdots \
\hat{y} &= f^L(W^L a^{L-1} + b^L)
\end{aligned}
$$

**Intuition:** The network transforms raw input into a hierarchical representation, layer by layer.

---

##  Loss Function and Training

To train the network, we define a **loss function**, e.g., Mean Squared Error (MSE) for regression:

$$
\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2
$$

* $\hat{y}_i$ = predicted output
* $y_i$ = true output

**Goal:** Minimize the loss by adjusting weights and biases.

---

## Backpropagation (Intuition)

**Backpropagation** is a way to compute **how much each weight contributed to the error** using the **chain rule of calculus**:

$$
\frac{\partial \mathcal{L}}{\partial W^l} = \frac{\partial \mathcal{L}}{\partial a^l} \cdot \frac{\partial a^l}{\partial z^l} \cdot \frac{\partial z^l}{\partial W^l}
$$

**Intuition:**

* Start from the output, compute the error
* Propagate the error backward through each layer
* Update weights in the direction that reduces loss

Weight update formula (gradient descent):

$$
W^l := W^l - \eta \frac{\partial \mathcal{L}}{\partial W^l}
$$

Where $\eta$ is the **learning rate**.

---

**Key Takeaways**

1. Multi-layer networks **compose functions**: Each layer transforms the representation of input data.
2. **Nonlinearity is essential**: Without activation functions, multiple layers collapse to a single linear transformation.
3. Forward propagation = compute predictions; backward propagation = compute gradients and update weights.
4. Hidden layers automatically **learn hierarchical features**, reducing the need for manual feature engineering.
5. Multilayer networks use **hidden layers**
6. Forward and backward propagation make learning possible
7. **Sigmoid** activation converts values to (0,1)
8. Loss tells us how wrong the model is
9. Optimizers reduce error by adjusting weights
10. Backpropagation enables efficient learning
11. This is the foundation of deep neural networks

## **Problem Statement: Predicting Student Exam Success**

Imagine we want to predict whether a student will **pass or fail an exam** based on the following features:

| Feature                | Symbol | Example |
| ---------------------- | ------ | ------- |
| IQ                     | (x_1)  | 95      |
| Hours of Study per day | (x_2)  | 4       |
| Hours of Play per day  | (x_3)  | 2       |

The **output** is:

$$
y =
\begin{cases}
1 & \text{if student passes} \
0 & \text{if student fails}
\end{cases}
$$

---

### **Step 1: Representing as a Neural Network**

We can design a **simple multi-layer neural network**:

* **Input Layer:** 3 neurons (IQ, study hours, play hours)
* **Hidden Layer 1:** 2 neurons
* **Output Layer:** 1 neuron (pass/fail)

Each neuron has:

$$
z = \sum w_i x_i + b
$$
$$
a = f(z) \quad (\text{activation function})
$$

* **Weights ((w))** determine the importance of each feature.
* **Bias ((b))** shifts the activation threshold.
* **Activation function (sigmoid)** squashes the output between 0 and 1, representing a probability.

---

### **Step 2: Forward Propagation**

**1. Hidden Layer 1**

For neuron 1 in hidden layer 1:

$$
z_1^{(1)} = w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b_1
$$
$$
a_1^{(1)} = f(z_1^{(1)}) \quad \text{(sigmoid)}
$$

For neuron 2:

$$
z_2^{(1)} = w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b_2
$$
$$
a_2^{(1)} = f(z_2^{(1)})
$$

**Intuition:**
The hidden layer transforms the raw features (IQ, study hours, play hours) into an **intermediate representation** that captures nonlinear combinations of the inputs.

---

**2. Output Layer**

The output neuron takes inputs from the hidden layer:

$$
z^{(2)} = w_{31} a_1^{(1)} + w_{32} a_2^{(1)} + b_3
$$
$$
\hat{y} = f(z^{(2)}) \quad \text{(sigmoid, probability of passing)}
$$

**Intuition:**
The output layer combines the learned features from hidden layers to make the final prediction.

---

### **Step 3: Calculating Loss**

We compare the predicted output ((\hat{y})) with the true label ((y)) using a **loss function**, e.g., **Mean Squared Error**:

$$
\mathcal{L} = (\hat{y} - y)^2
$$

**Example:**

* Predicted probability: (\hat{y} = 0.6)
* True output: (y = 1)
* Loss: ((0.6 - 1)^2 = 0.16)

**Intuition:**
This tells us **how wrong the prediction is**.

---

### **Step 4: Backward Propagation (Weight Updates)**

To reduce the loss, we adjust the weights **proportionally to their contribution to the error**:

$$
w := w - \eta \frac{\partial \mathcal{L}}{\partial w}
$$

* $\eta$ = learning rate
* $\frac{\partial \mathcal{L}}{\partial w}$ = gradient of loss w.r.t weight

**Intuition:**

* Neurons that contributed more to the error get updated more.
* This process repeats over many examples until the network predicts accurately.

---

### **Step 5: Key Takeaways from This Example**

1. **Hidden Layers Learn Features Automatically**

   * The network learns how to combine IQ, study, and play hours into features that help predict pass/fail.

2. **Nonlinearity Matters**

   * Without activation functions like sigmoid, the network would **only be able to learn linear rules** (e.g., a simple weighted sum).

3. **Forward + Backward Propagation**

   * Forward: compute predictions
   * Backward: update weights based on error

4. **Output as Probability**

   * The network can output a probability (0-1) of passing, which can be thresholded to make a binary prediction.

![alt text](..\images\bp_wu.png)

## Chain Rule of Derivatives in Neural Networks

This video explains the **chain rule of derivatives** and its importance in **weight updating** during backpropagation.

---

### 1. Weight Update Formula Recap

The generic weight update rule is:

$$
w_{\text{new}} = w_{\text{old}} - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial w_{\text{old}}}
$$

The key challenge is calculating the derivative (\frac{\partial \text{Loss}}{\partial w_{\text{old}}}) efficiently when a weight influences multiple neurons across layers.

---

### 2. Applying the Chain Rule

The chain rule helps decompose the derivative of the loss with respect to a weight into a product of simpler derivatives across layers:

$$
\frac{\partial \text{Loss}}{\partial w} = \frac{\partial \text{Loss}}{\partial O_{\text{next}}} \cdot \frac{\partial O_{\text{next}}}{\partial O_{\text{current}}} \cdot \frac{\partial O_{\text{current}}}{\partial w}
$$

* For a hidden neuron affecting multiple outputs, the chain rule ensures all paths are accounted for.
* This allows the calculation of gradients even for weights deep in the network.

---

### 3. Example

* Consider a network with one input layer, one hidden layer, and one output layer.
* To update a weight (w_4), the derivative is computed as:

$$
\frac{\partial \text{Loss}}{\partial w_4} = \frac{\partial \text{Loss}}{\partial O_2} \cdot \frac{\partial O_2}{\partial w_4}
$$

* For a weight in the first layer ((w_1)) influencing multiple downstream neurons, the derivative sums contributions through all paths:

$$
\frac{\partial \text{Loss}}{\partial w_1} = \frac{\partial \text{Loss}}{\partial O_2} \cdot \frac{\partial O_2}{\partial O_1} \cdot \frac{\partial O_1}{\partial w_1} + \dots
$$

This illustrates how **backpropagation relies on the chain rule** to propagate errors backward through multiple layers.

![](../images/cod.png)

---

**Key Takeaways**

* Chain rule allows calculating gradients for weights in **deep networks**.
* Gradients account for all paths from a weight to the output.
* These gradients are multiplied by the learning rate to update the weights.
* This process enables the network to **learn efficiently** via backpropagation.