### 🧠 Problem: Learn XOR-ish Behavior

Let’s say you want a neural network to learn that:

```
Input x = [1, 0]
Output y = 1
```


### 🔧 Neural Network Architecture

* **Input layer**: 2 features
* **Hidden layer**: 1 neuron (with sigmoid activation)
* **Output layer**: 1 neuron (with sigmoid activation)


### 📐 Initialize Parameters

Let’s assume:

* **Input**:
  $x = \begin{bmatrix} 1 \\ 0 \end{bmatrix}$

* **Target output**:
  $y = 1$

* **Weights and biases**:

  * $W_1 = \begin{bmatrix} 0.5 & -0.5 \end{bmatrix}$ (1×2)
  * $b_1 = 0$
  * $W_2 = \begin{bmatrix} 1 \end{bmatrix}$ (1×1)
  * $b_2 = 0$


### ✅ Step 1: Forward Pass

#### Hidden Layer

$$
z_1 = W_1 \cdot x + b_1 = (0.5)(1) + (-0.5)(0) + 0 = 0.5
$$

Apply **sigmoid** activation:

$$
a_1 = \sigma(z_1) = \frac{1}{1 + e^{-0.5}} ≈ 0.622
$$

#### Output Layer

$$
z_2 = W_2 \cdot a_1 + b_2 = (1)(0.622) + 0 = 0.622
$$

Again apply sigmoid:

$$
\hat{y} = \sigma(z_2) = \frac{1}{1 + e^{-0.622}} ≈ 0.650
$$

So, the **predicted output** is **0.65**.


### 🎯 Step 2: Compute Loss (Binary Cross Entropy)

$$
\text{Loss} = -[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})]
= -[\log(0.65)] ≈ 0.431
$$


### 🔁 Step 3: Backpropagation (Gradient Descent)

We’ll update weights using **gradient descent**. Let's do gradients step-by-step.

#### a) Output Layer Gradients

Let’s denote the derivative of loss w\.r.t. prediction as:

$$
\frac{dL}{d\hat{y}} = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}} = -\frac{1}{0.65} ≈ -1.538
$$

$$
\frac{d\hat{y}}{dz_2} = \hat{y}(1 - \hat{y}) = 0.65(0.35) = 0.2275
$$

$$
\frac{dL}{dz_2} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dz_2} = -1.538 \cdot 0.2275 ≈ -0.35
$$

Now compute:

* $\frac{dL}{dW_2} = \frac{dL}{dz_2} \cdot a_1 = -0.35 \cdot 0.622 ≈ -0.218$
* $\frac{dL}{db_2} = \frac{dL}{dz_2} = -0.35$

#### b) Hidden Layer Gradients

Use chain rule:

$$
\frac{dL}{da_1} = \frac{dL}{dz_2} \cdot W_2 = -0.35 \cdot 1 = -0.35
$$

$$
\frac{da_1}{dz_1} = a_1(1 - a_1) = 0.622(0.378) = 0.235
$$

$$
\frac{dL}{dz_1} = \frac{dL}{da_1} \cdot \frac{da_1}{dz_1} = -0.35 \cdot 0.235 ≈ -0.082
$$

Now compute:

* $\frac{dL}{dW_1} = \frac{dL}{dz_1} \cdot x^T = -0.082 \cdot \begin{bmatrix} 1 & 0 \end{bmatrix} = \begin{bmatrix} -0.082 & 0 \end{bmatrix}$
* $\frac{dL}{db_1} = -0.082$


### 📉 Step 4: Update Weights

Use a **learning rate** $\eta = 0.1$

* $W_2 := W_2 - \eta \cdot \frac{dL}{dW_2} = 1 - 0.1 \cdot (-0.218) ≈ 1.022$
* $b_2 := 0 - 0.1 \cdot (-0.35) = 0.035$
* $W_1 := \begin{bmatrix} 0.5 & -0.5 \end{bmatrix} - 0.1 \cdot \begin{bmatrix} -0.082 & 0 \end{bmatrix} = \begin{bmatrix} 0.5082 & -0.5 \end{bmatrix}$
* $b_1 := 0 - 0.1 \cdot (-0.082) = 0.0082$


### 🔁 Repeat

You can now do another forward pass with updated weights to see the loss go down.


What we have learnt so far: 
* Matrix multiplication
* Activation (sigmoid)
* Loss calculation
* Backpropagation (chain rule)
* Gradient descent



## 🧠 Setup: One hidden layer neural network

We’ll assume this structure:

* **Input**: $x \in \mathbb{R}^n$
* **Hidden layer**:

  * $z_1 = W_1 \cdot x + b_1$
  * $a_1 = \sigma(z_1)$
* **Output layer**:

  * $z_2 = W_2 \cdot a_1 + b_2$
  * $\hat{y} = \sigma(z_2)$

And loss:

$$
L = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
$$

We'll compute **4 gradients**:

* $\frac{\partial L}{\partial W_2}, \frac{\partial L}{\partial b_2}$
* $\frac{\partial L}{\partial W_1}, \frac{\partial L}{\partial b_1}$


## ✅ Step-by-step Gradient Flow


### ① Output Layer Gradients

**Start from the loss and go backward**.

Let’s define:

* $\hat{y} = \sigma(z_2)$
* $\sigma'(z_2) = \hat{y}(1 - \hat{y})$

#### a) $\frac{\partial L}{\partial z_2}$

$$
\frac{\partial L}{\partial z_2} = (\hat{y} - y)
$$

This is simple! For binary cross-entropy + sigmoid, this is the derivative.

#### b) Gradients for W₂ and b₂:

$$
\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial z_2} \cdot a_1^T
$$

$$
\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial z_2}
$$

🧠 So far:

* You got $\delta_2 = \hat{y} - y$
* Multiply by $a_1$ to get $\frac{dL}{dW_2}$


### ② Hidden Layer Gradients

We now pass the gradient back into the hidden layer.

#### a) Compute error term for hidden layer:

$$
\delta_1 = (W_2^T \cdot \delta_2) \cdot \sigma'(z_1)
$$

* $W_2^T \cdot \delta_2$ is how the output layer’s gradient flows backward.
* $\sigma'(z_1) = a_1(1 - a_1)$

So:

$$
\delta_1 = \text{error flowing into hidden layer}
$$

#### b) Gradients for W₁ and b₁:

$$
\frac{\partial L}{\partial W_1} = \delta_1 \cdot x^T
$$

$$
\frac{\partial L}{\partial b_1} = \delta_1
$$


## 🎯 Summary of Backpropagation Steps

From forward pass:

* $z_1 = W_1 x + b_1$
* $a_1 = \sigma(z_1)$
* $z_2 = W_2 a_1 + b_2$
* $\hat{y} = \sigma(z_2)$

Now backward pass:

```text
Step 1: δ₂ = (ŷ - y)

Step 2:
∂L/∂W₂ = δ₂ × a₁ᵀ
∂L/∂b₂ = δ₂

Step 3: δ₁ = (W₂ᵀ × δ₂) * σ'(z₁)

Step 4:
∂L/∂W₁ = δ₁ × xᵀ
∂L/∂b₁ = δ₁
```

This is the complete core of backpropagation for a 1-hidden-layer network. No harder than this.
