# Exercise 01: Train a Tiny 2-Layer Network Manually

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-031/exercise-01.ipynb)

## Setup

In [22]:
# Install required packages using the kernel's Python interpreter
import sys
import subprocess
import importlib

def install_if_missing(package, import_name=None):
    """Install package if it's not already installed."""
    if import_name is None:
        import_name = package

    try:
        importlib.import_module(import_name)
        print(f"âœ“ {package} is already installed")
    except ImportError:
        print(f"Installing {package}....")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"âœ“ {package} installed successfully")

# Install required packages
install_if_missing("numpy")

âœ“ numpy is already installed


**Goal:**

- Implement forward pass
- Compute loss
- Manually compute gradients
- Update weights
- Observe loss decreasek 2

### Step 0 - Setup

In [23]:
import numpy as np

np.random.seed(42)

### Step 1 - Tiny Dataset (Realistic-ish Binary Classification)

We simulate something simple but meaningful:

- If x1 + x2 > 3 â†’ class 1
- Else â†’ class 0

In [24]:
X = np.array([
    [2.0, 3.0],
    [1.0, 1.0],
    [3.0, 2.0],
    [0.5, 0.5]
])

y = np.array([[1], [0], [1], [0]])

# Batch size = 4

### Step 2 - Initialize Weights

In [25]:
W1 = np.random.randn(2, 2) * 0.1
b1 = np.zeros((1, 2))

W2 = np.random.randn(2, 1) * 0.1
b2 = np.zeros((1, 1))

lr = 0.1

### Step 3 - Define Functions

In [30]:
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def compute_loss(p, y): # just a fancy way of doing y_true - y_pred in maths using log loss with edge case handling over a batch of data.
    eps = 1e-8
    return -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))

### Step 4 â€” One Training Step (Manual Backprop)

In [31]:
# Forward pass
z1 = X @ W1 + b1 # @ is matrix multiplication in numpy
a1 = relu(z1)

z2 = a1 @ W2 + b2
p = sigmoid(z2)

loss = compute_loss(p, y)
print("Initial Loss:", loss)

# Backward pass

# Output layer gradient
dz2 = p - y                      # shape (4,1)

dW2 = a1.T @ dz2 / len(X)
db2 = np.mean(dz2, axis=0, keepdims=True)

# Backprop into layer 1
da1 = dz2 @ W2.T
dz1 = da1 * relu_derivative(z1)

dW1 = X.T @ dz1 / len(X)
db1 = np.mean(dz1, axis=0, keepdims=True)

# Update weights
W2 -= lr * dW2
b2 -= lr * db2
W1 -= lr * dW1
b1 -= lr * db1

Initial Loss: 0.1380176200223486


### Step 5 â€” Run Multiple Steps

In [32]:
# Re-initialize weights for full training
W1 = np.random.randn(2, 2) * 0.1
b1 = np.zeros((1, 2))

W2 = np.random.randn(2, 1) * 0.1
b2 = np.zeros((1, 1))

for epoch in range(200):
    # Forward
    z1 = X @ W1 + b1
    a1 = relu(z1)

    z2 = a1 @ W2 + b2
    p = sigmoid(z2)

    loss = compute_loss(p, y)

    # Backward
    dz2 = p - y
    dW2 = a1.T @ dz2 / len(X)
    db2 = np.mean(dz2, axis=0, keepdims=True)

    da1 = dz2 @ W2.T
    dz1 = da1 * relu_derivative(z1)

    dW1 = X.T @ dz1 / len(X)
    db1 = np.mean(dz1, axis=0, keepdims=True)

    # Update
    W2 -= lr * dW2
    b2 -= lr * db2
    W1 -= lr * dW1
    b1 -= lr * db1

    if epoch % 20 == 0:
        print(f"Epoch {epoch} | Loss: {loss:.4f}")

Epoch 0 | Loss: 0.6931
Epoch 20 | Loss: 0.6931
Epoch 40 | Loss: 0.6931
Epoch 60 | Loss: 0.6931
Epoch 80 | Loss: 0.6931
Epoch 100 | Loss: 0.6931
Epoch 120 | Loss: 0.6931
Epoch 140 | Loss: 0.6931
Epoch 160 | Loss: 0.6931
Epoch 180 | Loss: 0.6931


You should see loss decrease.

**No magic.**

Just:

- input Ã— gradient
- weight Ã— gradient
- repeat

## ðŸ”Ž What You Should Observe

- Loss decreases steadily.
- Removing ReLU changes behavior.
- Increasing learning rate can cause divergence.
- Zeroing W2 kills gradient flow to layer 1.
- Changing initialization changes training stability.

**We Encourage you to:**

- Print intermediate gradients.
- Set learning rate too high.
- Set W2 very small.
- Replace ReLU with sigmoid.

**Let it break.**


