# 🤖 Application 3: A Simple Neural Network for Image Classification

> *"A neural network is just a stack of linear algebra and a sprinkle of calculus."*

This is our capstone application! We will build a simple, one-hidden-layer neural network from scratch to classify handwritten digits from the famous MNIST dataset. This will bring together all the concepts from this course in a single, powerful example.

## 🎯 How the Math Comes Together

- **Linear Algebra**: The core of the network! Each layer performs a matrix multiplication (`W @ X + b`) to transform the data. We will be working with matrices of weights and vectors of inputs and biases.
- **Calculus**: We'll use the **Chain Rule** to perform **Backpropagation**. This is the algorithm for calculating the gradients of the loss function with respect to every weight and bias in the network, even those in the hidden layers. We'll also use an activation function (like ReLU or Tanh) and its derivative.
- **Probability & Statistics**: The final layer will use a **Softmax function** to output a probability distribution over the 10 possible digit classes. The loss function we'll use, **Cross-Entropy Loss**, is the multi-class generalization of the Log Loss we saw in logistic regression.
- **Optimization**: We'll use Mini-Batch Gradient Descent with an advanced optimizer like Adam to train the network's weights and biases.

## 📚 Import Essential Libraries and Load Data

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Plotting style
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Load MNIST data
print("Loading MNIST dataset...")
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
print("Dataset loaded!")

# Normalize pixel values to be between 0 and 1
X = X / 255.0

# One-hot encode the labels
encoder = OneHotEncoder(sparse_output=False, categories='auto')
y_onehot = encoder.fit_transform(y.reshape(-1, 1))

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.1, random_state=42)

print(f"Training data shape: {X_train.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Test data shape: {X_test.shape}")
print(f"Test labels shape: {y_test.shape}")

# Visualize some digits
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_train[i].reshape(28, 28), cmap='gray')
    ax.set_title(f"Label: {np.argmax(y_train[i])}")
    ax.axis('off')
plt.tight_layout()
plt.show()

---

# 🧠 Step 1: Building the Network Architecture

Our network will have:
1. **An Input Layer**: 784 neurons (one for each pixel in a 28x28 image).
2. **A Hidden Layer**: We'll choose 128 neurons. This layer will use the Tanh activation function.
3. **An Output Layer**: 10 neurons (one for each digit, 0-9). This layer will use the Softmax activation function to produce probabilities.

### Activation Functions (Calculus)
- **Tanh (Hyperbolic Tangent)**: Squashes values to a range of [-1, 1]. It's a zero-centered alternative to the sigmoid.
- **Softmax**: Generalizes the sigmoid function for multi-class problems. It takes a vector of scores and turns it into a probability distribution where all probabilities sum to 1.
$$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$

In [None]:
def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z)**2

def softmax(z):
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

def initialize_parameters(input_size, hidden_size, output_size):
    """ Initialize weights and biases for the network. """
    # Use He initialization for weights
    W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1. / input_size)
    b1 = np.zeros((1, hidden_size))
    W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1. / hidden_size)
    b2 = np.zeros((1, output_size))
    return {"W1": W1, "b1": b1, "W2": W2, "b2": b2}

---

# 🚀 Step 2: Forward and Backward Propagation

### Forward Propagation (Linear Algebra)
This is how we make a prediction. Data flows forward through the network.
1. `Z1 = X @ W1 + b1`
2. `A1 = tanh(Z1)` (Activation for hidden layer)
3. `Z2 = A1 @ W2 + b2`
4. `A2 = softmax(Z2)` (Final probabilities)

### Backward Propagation (Calculus - The Chain Rule)
This is how we learn. We calculate the error at the output and propagate it backward through the network to find out how much each weight and bias contributed to the error. This gives us the gradients.

1. **Output Layer Error (`dZ2`)**: `A2 - y`
2. **Output Layer Gradients (`dW2`, `db2`)**: `dW2 = A1.T @ dZ2`, `db2 = sum(dZ2)`
3. **Hidden Layer Error (`dZ1`)**: `dZ2 @ W2.T * tanh_derivative(Z1)`
4. **Hidden Layer Gradients (`dW1`, `db1`)**: `dW1 = X.T @ dZ1`, `db1 = sum(dZ1)`

In [None]:
def forward_propagation(X, params):
    Z1 = X @ params['W1'] + params['b1']
    A1 = tanh(Z1)
    Z2 = A1 @ params['W2'] + params['b2']
    A2 = softmax(Z2)
    cache = {"Z1": Z1, "A1": A1, "Z2": Z2, "A2": A2}
    return A2, cache

def backward_propagation(X, y, params, cache):
    m = X.shape[0]
    
    # Output layer
    dZ2 = cache['A2'] - y
    dW2 = (1/m) * cache['A1'].T @ dZ2
    db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)
    
    # Hidden layer
    dA1 = dZ2 @ params['W2'].T
    dZ1 = dA1 * tanh_derivative(cache['Z1'])
    dW1 = (1/m) * X.T @ dZ1
    db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)
    
    grads = {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}
    return grads

---

# ⚙️ Step 3: The Training Loop (Optimization)

Now we put it all together in a training loop using Mini-Batch Gradient Descent.

In [None]:
def train_nn(X_train, y_train, hidden_size=128, epochs=10, batch_size=64, learning_rate=0.1):
    input_size = X_train.shape[1]
    output_size = y_train.shape[1]
    
    params = initialize_parameters(input_size, hidden_size, output_size)
    loss_history = []
    
    for epoch in range(epochs):
        # Shuffle the data
        permutation = np.random.permutation(X_train.shape[0])
        X_shuffled = X_train[permutation]
        y_shuffled = y_train[permutation]
        
        for i in range(0, X_train.shape[0], batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]
            
            # Forward propagation
            A2, cache = forward_propagation(X_batch, params)
            
            # Backward propagation
            grads = backward_propagation(X_batch, y_batch, params, cache)
            
            # Update parameters (simple GD)
            params['W1'] -= learning_rate * grads['dW1']
            params['b1'] -= learning_rate * grads['db1']
            params['W2'] -= learning_rate * grads['dW2']
            params['b2'] -= learning_rate * grads['db2']
        
        # Calculate loss for the epoch
        full_preds, _ = forward_propagation(X_train, params)
        # Cross-entropy loss
        loss = -np.mean(y_train * np.log(full_preds + 1e-8))
        loss_history.append(loss)
        
        print(f"Epoch {epoch+1}/{epochs} - Loss: {loss:.4f}")
        
    return params, loss_history

# Train the neural network
trained_params, loss_history = train_nn(X_train, y_train, epochs=15, learning_rate=0.5)

---

# 📊 Step 4: Evaluation and Analysis

Let's see how well our trained network performs on the unseen test data.

In [None]:
def evaluate(X_test, y_test, params):
    """ Evaluate the model's accuracy on the test set. """
    # Make predictions
    predictions, _ = forward_propagation(X_test, params)
    
    # Get the class with the highest probability
    predicted_labels = np.argmax(predictions, axis=1)
    true_labels = np.argmax(y_test, axis=1)
    
    # Calculate accuracy
    accuracy = np.mean(predicted_labels == true_labels)
    return accuracy

accuracy = evaluate(X_test, y_test, trained_params)
print(f"\nTest Accuracy: {accuracy * 100:.2f}%")

# Plot the loss curve
plt.figure(figsize=(10, 6))
plt.plot(loss_history)
plt.title('Training Loss Curve', fontsize=16, weight='bold')
plt.xlabel('Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.grid(True)
plt.show()

# Visualize some predictions
predictions, _ = forward_propagation(X_test, trained_params)
predicted_labels = np.argmax(predictions, axis=1)
true_labels = np.argmax(y_test, axis=1)

fig, axes = plt.subplots(3, 5, figsize=(15, 9))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(28, 28), cmap='gray')
    pred_label = predicted_labels[i]
    true_label = true_labels[i]
    color = 'green' if pred_label == true_label else 'red'
    ax.set_title(f"Pred: {pred_label} | True: {true_label}", color=color)
    ax.axis('off')
plt.tight_layout()
plt.show()

### Analysis

With just a few epochs of training, our simple neural network, built from scratch, can achieve impressive accuracy on the MNIST dataset! The visualizations show correct predictions in green and incorrect ones in red, giving us a qualitative feel for the model's performance.

This notebook is the culmination of our journey, demonstrating how the abstract mathematical concepts of linear algebra (matrix multiplies), calculus (the chain rule for backpropagation), probability (softmax and cross-entropy), and optimization (gradient descent) all come together to create a powerful, learning machine.