# Exercise 03: Go Beyond Full-Batch Training (Challenge Mode)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-031/exercise-03.ipynb)

## Setup

In [1]:
# Install required packages using the kernel's Python interpreter
import sys
import subprocess
import importlib

def install_if_missing(package, import_name=None):
    """Install package if it's not already installed."""
    if import_name is None:
        import_name = package

    try:
        importlib.import_module(import_name)
        print(f"âœ“ {package} is already installed")
    except ImportError:
        print(f"Installing {package}....")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"âœ“ {package} installed successfully")

# Install required packages
install_if_missing("numpy")
install_if_missing("scikit-learn", "sklearn")

âœ“ numpy is already installed
âœ“ scikit-learn is already installed


You already trained using full-batch gradient descent.

Now implement three real-world improvements manually:

1. Mini-batch SGD
2. L2 Regularization
3. Overfitting Demonstration

**No frameworks.**
**Still NumPy.**

### Initial Setup â€” Load Data and Define Functions

In [6]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

np.random.seed(42)

# Load real dataset
data = load_breast_cancer()
X = data.data
y = data.target.reshape(-1, 1)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Normalize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define functions
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def compute_loss(p, y):
    eps = 1e-8
    return -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))

def accuracy(p, y):
    preds = (p > 0.5).astype(int)
    return np.mean(preds == y)

## ðŸ”¥ Part A â€” Implement Mini-Batch SGD

Right now, you use:

```
dW = X.T @ dz / N
```

That's full-batch.

Now split training data into mini-batches.

### Step 1 â€” Create Mini-Batch Loop

In [7]:
# Initialize network
input_dim = X_train.shape[1]
hidden_dim = 16

W1 = np.random.randn(input_dim, hidden_dim) * 0.01
b1 = np.zeros((1, hidden_dim))

W2 = np.random.randn(hidden_dim, 1) * 0.01
b2 = np.zeros((1, 1))

lr = 0.05
batch_size = 32
epochs = 100

for epoch in range(epochs):

    # Shuffle data each epoch
    indices = np.random.permutation(len(X_train))
    X_train_shuffled = X_train[indices]
    y_train_shuffled = y_train[indices]

    for start in range(0, len(X_train_shuffled), batch_size):
        end = start + batch_size
        X_batch = X_train_shuffled[start:end]
        y_batch = y_train_shuffled[start:end]

        # Forward
        z1 = X_batch @ W1 + b1
        a1 = relu(z1)

        z2 = a1 @ W2 + b2
        p = sigmoid(z2)

        # Backward
        dz2 = p - y_batch
        dW2 = a1.T @ dz2 / len(X_batch)
        db2 = np.mean(dz2, axis=0, keepdims=True)

        da1 = dz2 @ W2.T
        dz1 = da1 * relu_derivative(z1)

        dW1 = X_batch.T @ dz1 / len(X_batch)
        db1 = np.mean(dz1, axis=0, keepdims=True)

        # Update
        W2 -= lr * dW2
        b2 -= lr * db2
        W1 -= lr * dW1
        b1 -= lr * db1

    if epoch % 10 == 0:
        # Evaluate on full training set
        z1_eval = X_train @ W1 + b1
        a1_eval = relu(z1_eval)
        z2_eval = a1_eval @ W2 + b2
        p_eval = sigmoid(z2_eval)
        loss = compute_loss(p_eval, y_train)
        train_acc = accuracy(p_eval, y_train)
        print(f"Epoch {epoch} | Loss: {loss:.4f} | Train Acc: {train_acc:.4f}")

Epoch 0 | Loss: 0.6817 | Train Acc: 0.6286
Epoch 10 | Loss: 0.1640 | Train Acc: 0.9626
Epoch 20 | Loss: 0.0863 | Train Acc: 0.9824
Epoch 30 | Loss: 0.0694 | Train Acc: 0.9846
Epoch 40 | Loss: 0.0616 | Train Acc: 0.9824
Epoch 50 | Loss: 0.0570 | Train Acc: 0.9824
Epoch 60 | Loss: 0.0537 | Train Acc: 0.9824
Epoch 70 | Loss: 0.0510 | Train Acc: 0.9824
Epoch 80 | Loss: 0.0491 | Train Acc: 0.9824
Epoch 90 | Loss: 0.0470 | Train Acc: 0.9824


## ðŸ”¥ Part B â€” Add L2 Regularization Manually

Overfitting happens when weights grow large.

We fix that by adding penalty:

**Loss = original_loss + Î» * ||W||Â²**

Gradient becomes:

**dW += Î» * W**

### Modify Backward Step

In [8]:
# Re-initialize network
W1 = np.random.randn(input_dim, hidden_dim) * 0.01
b1 = np.zeros((1, hidden_dim))

W2 = np.random.randn(hidden_dim, 1) * 0.01
b2 = np.zeros((1, 1))

lr = 0.05
lambda_reg = 0.01
batch_size = 32
epochs = 100

for epoch in range(epochs):

    # Shuffle data each epoch
    indices = np.random.permutation(len(X_train))
    X_train_shuffled = X_train[indices]
    y_train_shuffled = y_train[indices]

    for start in range(0, len(X_train_shuffled), batch_size):
        end = start + batch_size
        X_batch = X_train_shuffled[start:end]
        y_batch = y_train_shuffled[start:end]

        # Forward
        z1 = X_batch @ W1 + b1
        a1 = relu(z1)

        z2 = a1 @ W2 + b2
        p = sigmoid(z2)

        # Backward
        dz2 = p - y_batch
        dW2 = a1.T @ dz2 / len(X_batch)
        db2 = np.mean(dz2, axis=0, keepdims=True)

        da1 = dz2 @ W2.T
        dz1 = da1 * relu_derivative(z1)

        dW1 = X_batch.T @ dz1 / len(X_batch)
        db1 = np.mean(dz1, axis=0, keepdims=True)

        # Add L2 regularization
        dW2 += lambda_reg * W2
        dW1 += lambda_reg * W1

        # Update
        W2 -= lr * dW2
        b2 -= lr * db2
        W1 -= lr * dW1
        b1 -= lr * db1

    if epoch % 10 == 0:
        # Evaluate on full training set
        z1_eval = X_train @ W1 + b1
        a1_eval = relu(z1_eval)
        z2_eval = a1_eval @ W2 + b2
        p_eval = sigmoid(z2_eval)
        loss = compute_loss(p_eval, y_train)
        train_acc = accuracy(p_eval, y_train)
        print(f"Epoch {epoch} | Loss: {loss:.4f} | Train Acc: {train_acc:.4f}")

Epoch 0 | Loss: 0.6814 | Train Acc: 0.6286
Epoch 10 | Loss: 0.1798 | Train Acc: 0.9560
Epoch 20 | Loss: 0.0956 | Train Acc: 0.9824
Epoch 30 | Loss: 0.0766 | Train Acc: 0.9824
Epoch 40 | Loss: 0.0691 | Train Acc: 0.9824
Epoch 50 | Loss: 0.0650 | Train Acc: 0.9824
Epoch 60 | Loss: 0.0623 | Train Acc: 0.9824
Epoch 70 | Loss: 0.0600 | Train Acc: 0.9846
Epoch 80 | Loss: 0.0587 | Train Acc: 0.9846
Epoch 90 | Loss: 0.0575 | Train Acc: 0.9846


**What This Does**

- Penalizes large weights
- Shrinks model complexity
- Improves generalization
- Stabilizes training

Now they see regularization isn't magic.

It's just adding gradient pressure toward zero.

## ðŸ”¥ Part C â€” Overfitting Demonstration

Now we make it hurt.

### Shrink Training Data Artificially

In [9]:
# Use only 50 training samples
small_X = X_train[:50]
small_y = y_train[:50]

# Increase model capacity
hidden_dim = 128

# Re-initialize network
W1 = np.random.randn(input_dim, hidden_dim) * 0.01
b1 = np.zeros((1, hidden_dim))

W2 = np.random.randn(hidden_dim, 1) * 0.01
b2 = np.zeros((1, 1))

lr = 0.05
lambda_reg = 0.0  # No regularization to encourage overfitting
batch_size = 32
epochs = 1000

for epoch in range(epochs):

    # Shuffle data each epoch
    indices = np.random.permutation(len(small_X))
    X_train_shuffled = small_X[indices]
    y_train_shuffled = small_y[indices]

    for start in range(0, len(X_train_shuffled), batch_size):
        end = start + batch_size
        X_batch = X_train_shuffled[start:end]
        y_batch = y_train_shuffled[start:end]

        # Forward
        z1 = X_batch @ W1 + b1
        a1 = relu(z1)

        z2 = a1 @ W2 + b2
        p = sigmoid(z2)

        # Backward
        dz2 = p - y_batch
        dW2 = a1.T @ dz2 / len(X_batch)
        db2 = np.mean(dz2, axis=0, keepdims=True)

        da1 = dz2 @ W2.T
        dz1 = da1 * relu_derivative(z1)

        dW1 = X_batch.T @ dz1 / len(X_batch)
        db1 = np.mean(dz1, axis=0, keepdims=True)

        # Add L2 regularization (if enabled)
        if lambda_reg > 0:
            dW2 += lambda_reg * W2
            dW1 += lambda_reg * W1

        # Update
        W2 -= lr * dW2
        b2 -= lr * db2
        W1 -= lr * dW1
        b1 -= lr * db1

    if epoch % 100 == 0:
        # Evaluate on small training set
        z1_train = small_X @ W1 + b1
        a1_train = relu(z1_train)
        z2_train = a1_train @ W2 + b2
        p_train = sigmoid(z2_train)
        train_loss = compute_loss(p_train, small_y)
        train_acc = accuracy(p_train, small_y)
        
        # Evaluate on full test set
        z1_test = X_test @ W1 + b1
        a1_test = relu(z1_test)
        z2_test = a1_test @ W2 + b2
        p_test = sigmoid(z2_test)
        test_loss = compute_loss(p_test, y_test)
        test_acc = accuracy(p_test, y_test)
        
        print(f"Epoch {epoch}")
        print(f"  Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f}")
        print(f"  Test Loss:  {test_loss:.4f} | Test Acc:  {test_acc:.4f}")
        print(f"  Gap: {train_acc - test_acc:.4f}")
        print()

Epoch 0
  Train Loss: 0.6903 | Train Acc: 0.6400
  Test Loss:  0.6908 | Test Acc:  0.6228
  Gap: 0.0172

Epoch 100
  Train Loss: 0.0555 | Train Acc: 1.0000
  Test Loss:  0.1143 | Test Acc:  0.9649
  Gap: 0.0351

Epoch 200
  Train Loss: 0.0214 | Train Acc: 1.0000
  Test Loss:  0.0893 | Test Acc:  0.9474
  Gap: 0.0526

Epoch 300
  Train Loss: 0.0125 | Train Acc: 1.0000
  Test Loss:  0.0856 | Test Acc:  0.9474
  Gap: 0.0526

Epoch 400
  Train Loss: 0.0084 | Train Acc: 1.0000
  Test Loss:  0.0856 | Test Acc:  0.9474
  Gap: 0.0526

Epoch 500
  Train Loss: 0.0061 | Train Acc: 1.0000
  Test Loss:  0.0864 | Test Acc:  0.9474
  Gap: 0.0526

Epoch 600
  Train Loss: 0.0047 | Train Acc: 1.0000
  Test Loss:  0.0876 | Test Acc:  0.9474
  Gap: 0.0526

Epoch 700
  Train Loss: 0.0038 | Train Acc: 1.0000
  Test Loss:  0.0889 | Test Acc:  0.9561
  Gap: 0.0439

Epoch 800
  Train Loss: 0.0031 | Train Acc: 1.0000
  Test Loss:  0.0901 | Test Acc:  0.9561
  Gap: 0.0439

Epoch 900
  Train Loss: 0.0026 | Train 

**What happens?**

- Training accuracy â†’ ~100%
- Test accuracy â†’ drops
- Loss gap widens

That's overfitting.

Real.
Visible.
Measurable.

### ðŸ”¥ Optional â€” Make It Worse

Remove regularization.

Increase learning rate slightly.

Watch instability.

**You now see:**

- Capacity
- Optimization
- Generalization

interacting in real time.

## ðŸŽ¯ What This Exercise Teaches

- Mini-batch introduces stochasticity.
- L2 regularization adds gradient pressure.
- Overfitting is capacity vs data imbalance.
- Nothing mystical â€” all gradient manipulation.