# Lecture 5 ‚Äî From Linear Models to Neural Networks: Building Multi-Layer Perceptrons from Scratch

Welcome to **Lecture 5** of *Practical Introduction to Machine Learning and Deep Learning*!  
This lecture is part of the **SAIR ML/DL Roadmap & Bootcamp**.

## üöÄ NEW: PyTorch Integration & Motivation

**Why we're adding PyTorch now:**
- This is the **last lecture** building everything from scratch
- From Lecture 6 onward, we'll use **PyTorch** for all implementations
- Understanding the fundamentals helps you appreciate what PyTorch does automatically
- Let's validate our from-scratch implementation against PyTorch!

## üå± Why This Lecture Matters  

In previous lectures, we mastered **linear models**:
- ‚úÖ Linear Regression (continuous outputs)
- ‚úÖ Logistic Regression (binary classification)  
- ‚úÖ Softmax Regression (multi-class classification)

But what if our data has **complex, non-linear patterns**? What if a simple straight line or plane can't separate our classes?

> *"Neural networks are just logistic regression repeated many times with non-linearities in between."*

## üìñ What You'll Learn

1. **The limitations of linear models**
2. **Biological inspiration for neural networks**
3. **Neuron: The fundamental building block**
4. **Activation functions: ReLU, Tanh, Sigmoid**
5. **Forward propagation through multiple layers**
6. **Backpropagation: The chain rule in action**
7. **Implementing Multi-Layer Perceptron (MLP) from scratch**
8. **Visualizing learning and decision boundaries**
9. **Comparing with sklearn's MLP**
10. **üî¨ NEW: Validating against PyTorch implementation**
11. **Real-world applications and next steps**

---

## üß† The Core Idea

**Linear Models:**  
$$\hat{y} = \sigma(\mathbf{X}\mathbf{W} + b)$$

**Neural Networks:**  
$$\hat{y} = \sigma_2(\mathbf{W}_2 \cdot \sigma_1(\mathbf{W}_1 \mathbf{X} + b_1) + b_2)$$

Where $\sigma_1$ and $\sigma_2$ are **non-linear activation functions** that enable the network to learn complex patterns.

üí° **Key Insight:** By stacking linear transformations with non-linearities, we can approximate any continuous function!

## Part 1: The Need for Non-Linearity

### Step 1Ô∏è‚É£ ‚Äî Limitations of Linear Models

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_circles
import torch
import torch.nn as nn
import torch.optim as optim

print("üîß PyTorch version:", torch.__version__)
print("üöÄ CUDA available:", torch.cuda.is_available())

# Create non-linearly separable datasets
np.random.seed(42)
torch.manual_seed(42)

# Moons dataset
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)
y_moons = y_moons.reshape(-1, 1)

# Circles dataset  
X_circles, y_circles = make_circles(n_samples=300, noise=0.1, factor=0.5, random_state=42)
y_circles = y_circles.reshape(-1, 1)

# Convert to PyTorch tensors for later use
X_moons_tensor = torch.FloatTensor(X_moons)
y_moons_tensor = torch.FloatTensor(y_moons)
X_circles_tensor = torch.FloatTensor(X_circles)
y_circles_tensor = torch.FloatTensor(y_circles)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Moons
axes[0].scatter(X_moons[y_moons.flatten() == 0, 0], X_moons[y_moons.flatten() == 0, 1], 
                color='blue', label='Class 0', alpha=0.6)
axes[0].scatter(X_moons[y_moons.flatten() == 1, 0], X_moons[y_moons.flatten() == 1, 1], 
                color='red', label='Class 1', alpha=0.6)
axes[0].set_title('üåô Moons Dataset (Non-Linear)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Circles
axes[1].scatter(X_circles[y_circles.flatten() == 0, 0], X_circles[y_circles.flatten() == 0, 1], 
                color='blue', label='Class 0', alpha=0.6)
axes[1].scatter(X_circles[y_circles.flatten() == 1, 0], X_circles[y_circles.flatten() == 1, 1], 
                color='red', label='Class 1', alpha=0.6)
axes[1].set_title('‚≠ï Circles Dataset (Non-Linear)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

NameError: name 'torch' is not defined

### Step 2Ô∏è‚É£ ‚Äî Try Linear Model on Non-Linear Data

In [None]:
class LogisticRegression:
    def __init__(self, n_features, lr=0.01):
        """
        Initialize Logistic Regression model
        
        Args:
            n_features: number of input features
            lr: learning rate
        """
        self.W = np.random.randn(n_features, 1) * 0.01
        self.b = np.zeros((1, 1))
        self.lr = lr
        self.losses = []
        self.accuracies = []
    
    def sigmoid(self, z):
        """Sigmoid activation function"""
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def forward(self, X):
        """
        Forward pass: compute predictions
        
        Args:
            X: input features (m, n)
        
        Returns:
            predictions: probabilities (m, 1)
        """
        z = X @ self.W + self.b
        return self.sigmoid(z)
    
    def compute_loss(self, y_pred, y_true):
        """
        Compute Binary Cross-Entropy loss
        
        Args:
            y_pred: predicted probabilities (m, 1)
            y_true: true labels (m, 1)
        
        Returns:
            loss: scalar BCE loss
        """
        # Clip predictions to prevent log(0)
        y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
        
        m = len(y_true)
        loss = -np.mean(
            y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
        )
        return loss
    
    def compute_accuracy(self, y_pred, y_true):
        """
        Compute classification accuracy
        
        Args:
            y_pred: predicted probabilities (m, 1)
            y_true: true labels (m, 1)
        
        Returns:
            accuracy: percentage of correct predictions
        """
        predictions = (y_pred >= 0.5).astype(int)
        return np.mean(predictions == y_true) * 100
    
    def backward(self, X, y_pred, y_true):
        """
        Compute gradients
        
        Args:
            X: input features (m, n)
            y_pred: predicted probabilities (m, 1)
            y_true: true labels (m, 1)
        
        Returns:
            dW: gradient w.r.t weights
            db: gradient w.r.t bias
        """
        m = len(y_true)
        dW = (1/m) * (X.T @ (y_pred - y_true))
        db = (1/m) * np.sum(y_pred - y_true)
        return dW, db
    
    def step(self, dW, db):
        """Update parameters"""
        self.W -= self.lr * dW
        self.b -= self.lr * db
    
    def fit(self, X, y, epochs=1000, verbose=True):
        """
        Train the model
        
        Args:
            X: training features (m, n)
            y: training labels (m, 1)
            epochs: number of training iterations
            verbose: whether to print progress
        """
        for i in range(epochs):
            # Forward pass
            y_pred = self.forward(X) # sigmoid(x@w + b)
            
            # Compute loss and accuracy
            loss = self.compute_loss(y_pred, y)
            acc = self.compute_accuracy(y_pred, y)
            
            self.losses.append(loss)
            self.accuracies.append(acc)
            
            # Backward pass
            dW, db = self.backward(X, y_pred, y)
            
            # Update parameters
            self.step(dW, db)
            
            if verbose and (i % 100 == 0 or i == epochs - 1):
                print(f"Epoch {i:4d} | Loss: {loss:.4f} | Accuracy: {acc:.2f}%")
    
    def predict(self, X):
        """
        Make predictions on new data
        
        Args:
            X: input features (m, n)
        
        Returns:
            predictions: binary predictions (m, 1)
        """
        y_pred = self.forward(X)
        return (y_pred >= 0.5).astype(int)
    
    def predict_proba(self, X):
        """
        Get probability predictions
        
        Args:
            X: input features (m, n)
        
        Returns:
            probabilities: predicted probabilities (m, 1)
        """
        return self.forward(X)

# Try logistic regression on moons dataset
linear_model = LogisticRegression(n_features=2, lr=0.1)
linear_model.fit(X_moons, y_moons, epochs=1000, verbose=False)

# Evaluate
y_pred = linear_model.predict(X_moons)
accuracy = linear_model.compute_accuracy(linear_model.forward(X_moons), y_moons)

print(f"üìä Logistic Regression on Moons Dataset:")
print(f"Accuracy: {accuracy:.2f}%")

# Plot decision boundary
def plot_decision_boundary_linear(model, X, y, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    plt.contour(xx, yy, Z, colors='black', linewidths=1, levels=[0.5])
    
    plt.scatter(X[y.flatten() == 0, 0], X[y.flatten() == 0, 1], 
                color='blue', label='Class 0', alpha=0.6)
    plt.scatter(X[y.flatten() == 1, 0], X[y.flatten() == 1, 1], 
                color='red', label='Class 1', alpha=0.6)
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

plot_decision_boundary_linear(linear_model, X_moons, y_moons, 
                            "‚ùå Linear Decision Boundary on Non-Linear Data")

### Step 3Ô∏è‚É£ ‚Äî Biological Inspiration: The Neuron

In [None]:
# Visualize biological vs artificial neuron
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Biological neuron
axes[0].text(0.5, 0.5, 'üß† Biological Neuron\n\n‚Ä¢ Dendrites (inputs)\n‚Ä¢ Cell body (processing)\n‚Ä¢ Axon (output)\n‚Ä¢ Synapses (connections)',
            ha='center', va='center', fontsize=14, bbox=dict(boxstyle="round,pad=0.3", facecolor="lightblue"))
axes[0].set_title('Biological Neuron', fontsize=16, weight='bold')
axes[0].axis('off')

# Artificial neuron
neuron_diagram = """
      x‚ÇÅ ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí w‚ÇÅ
      x‚ÇÇ ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí w‚ÇÇ       Œ£ (w‚ãÖx + b)     œÉ(¬∑)      ≈∑
      ... ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí ...  ‚Üí  -----------  ‚Üí  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ ‚Üí 
      x‚Çô ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí w‚Çô
               b (bias)
"""
axes[1].text(0.5, 0.5, neuron_diagram, ha='center', va='center', fontsize=16, 
            fontfamily='monospace', bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgreen"))
axes[1].set_title('Artificial Neuron (Perceptron)', fontsize=16, weight='bold')
axes[1].axis('off')

plt.tight_layout()
plt.show()

print("üî¨ Key Analogy:")
print("Dendrites ‚Üí Input features (x‚ÇÅ, x‚ÇÇ, ..., x‚Çô)")
print("Synapses ‚Üí Weights (w‚ÇÅ, w‚ÇÇ, ..., w‚Çô)")
print("Cell body ‚Üí Summation + Activation")
print("Axon ‚Üí Output (≈∑)")

### Step 4Ô∏è‚É£ ‚Äî Activation Functions: Introducing Non-Linearity

In [None]:
# Common activation functions
def relu(x):
    return np.maximum(0, x)

def tanh(x):
    return np.tanh(x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

# Plot activation functions
x = np.linspace(-5, 5, 1000)

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# ReLU
axes[0,0].plot(x, relu(x), 'b-', linewidth=2)
axes[0,0].set_title('ReLU: $f(x) = max(0, x)$')
axes[0,0].grid(True, alpha=0.3)
axes[0,0].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[0,0].axvline(x=0, color='k', linestyle='-', alpha=0.3)

# Tanh
axes[0,1].plot(x, tanh(x), 'r-', linewidth=2)
axes[0,1].set_title('Tanh: $f(x) = \\frac{e^x - e^{-x}}{e^x + e^{-x}}$')
axes[0,1].grid(True, alpha=0.3)
axes[0,1].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[0,1].axvline(x=0, color='k', linestyle='-', alpha=0.3)

# Sigmoid
axes[1,0].plot(x, sigmoid(x), 'g-', linewidth=2)
axes[1,0].set_title('Sigmoid: $f(x) = \\frac{1}{1 + e^{-x}}$')
axes[1,0].grid(True, alpha=0.3)
axes[1,0].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[1,0].axvline(x=0, color='k', linestyle='-', alpha=0.3)

# Leaky ReLU
axes[1,1].plot(x, leaky_relu(x), 'purple', linewidth=2)
axes[1,1].set_title('Leaky ReLU: $f(x) = max(Œ±x, x)$')
axes[1,1].grid(True, alpha=0.3)
axes[1,1].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[1,1].axvline(x=0, color='k', linestyle='-', alpha=0.3)

plt.tight_layout()
plt.show()

### Step 5Ô∏è‚É£ ‚Äî Derivatives of Activation Functions

In [None]:
# Derivatives
def relu_derivative(x):
    return (x > 0).astype(float)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def leaky_relu_derivative(x, alpha=0.01):
    return np.where(x > 0, 1, alpha)

# Plot derivatives
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# ReLU derivative
axes[0,0].plot(x, relu_derivative(x), 'b-', linewidth=2)
axes[0,0].set_title('ReLU Derivative')
axes[0,0].grid(True, alpha=0.3)
axes[0,0].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[0,0].axvline(x=0, color='k', linestyle='-', alpha=0.3)

# Tanh derivative
axes[0,1].plot(x, tanh_derivative(x), 'r-', linewidth=2)
axes[0,1].set_title('Tanh Derivative')
axes[0,1].grid(True, alpha=0.3)
axes[0,1].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[0,1].axvline(x=0, color='k', linestyle='-', alpha=0.3)

# Sigmoid derivative
axes[1,0].plot(x, sigmoid_derivative(x), 'g-', linewidth=2)
axes[1,0].set_title('Sigmoid Derivative')
axes[1,0].grid(True, alpha=0.3)
axes[1,0].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[1,0].axvline(x=0, color='k', linestyle='-', alpha=0.3)

# Leaky ReLU derivative
axes[1,1].plot(x, leaky_relu_derivative(x), 'purple', linewidth=2)
axes[1,1].set_title('Leaky ReLU Derivative')
axes[1,1].grid(True, alpha=0.3)
axes[1,1].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[1,1].axvline(x=0, color='k', linestyle='-', alpha=0.3)

plt.tight_layout()
plt.show()

### Step 6Ô∏è‚É£ ‚Äî Why ReLU is Popular in Deep Learning

In [None]:
print("üß† Why ReLU Dominates Deep Learning:")
print("\n‚úÖ Advantages:")
print("‚Ä¢ Non-saturating for positive values ‚Üí no vanishing gradient")
print("‚Ä¢ Computationally efficient (max operation)")
print("‚Ä¢ Sparse activation ‚Üí more efficient representations")
print("‚Ä¢ Biological plausibility (similar to real neurons)")

print("\n‚ö†Ô∏è Challenges:")
print("‚Ä¢ Dying ReLU problem (neurons can get stuck at 0)")
print("‚Ä¢ Not zero-centered")

print("\nüîß Solutions:")
print("‚Ä¢ Leaky ReLU, Parametric ReLU, ELU variants")
print("‚Ä¢ Proper weight initialization")
print("‚Ä¢ Batch Normalization")

## Part 2: Building Neural Networks from Scratch

### Step 7Ô∏è‚É£ ‚Äî Single Layer ‚Üí Multi-Layer Architecture

In [None]:
# Visualize network architectures
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Single layer (Logistic Regression)
single_layer = """
Input Layer      Hidden Layer      Output Layer
    x‚ÇÅ ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí [Neuron] ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí ≈∑
    x‚ÇÇ ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí 
    ...            
    x‚Çô ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí 
"""
axes[0].text(0.5, 0.5, single_layer, ha='center', va='center', fontsize=14, 
            fontfamily='monospace', bbox=dict(boxstyle="round,pad=0.3", facecolor="lightyellow"))
axes[0].set_title('Single Layer (Logistic Regression)', fontsize=16, weight='bold')
axes[0].axis('off')

# Multi-layer (Neural Network)
multi_layer = """
Input Layer      Hidden Layer 1    Hidden Layer 2    Output Layer
    x‚ÇÅ ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí [N] ‚Üí [N] ‚Üí [N] ‚îÄ‚îÄ‚Üí [N] ‚Üí [N] ‚Üí [N] ‚îÄ‚îÄ‚Üí ≈∑
    x‚ÇÇ ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí [N] ‚Üí [N] ‚Üí [N] ‚îÄ‚îÄ‚Üí [N] ‚Üí [N] ‚Üí [N] ‚îÄ‚îÄ‚Üí 
    ...       [N] ‚Üí [N] ‚Üí [N]     [N] ‚Üí [N] ‚Üí [N]
    x‚Çô ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí [N] ‚Üí [N] ‚Üí [N] ‚îÄ‚îÄ‚Üí [N] ‚Üí [N] ‚Üí [N] ‚îÄ‚îÄ‚Üí 
"""
axes[1].text(0.5, 0.5, multi_layer, ha='center', va='center', fontsize=12, 
            fontfamily='monospace', bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgreen"))
axes[1].set_title('Multi-Layer Neural Network', fontsize=16, weight='bold')
axes[1].axis('off')

plt.tight_layout()
plt.show()

print("üöÄ Key Insight:")
print("Multiple layers allow the network to learn hierarchical features:")
print("Layer 1: Simple features (edges, corners)")
print("Layer 2: Complex features (shapes, patterns)")  
print("Layer 3: Very complex features (objects, concepts)")

### Step 8Ô∏è‚É£ ‚Äî Forward Propagation: Step-by-Step

In [None]:
# Manual forward propagation example
np.random.seed(42)

# Sample input
X_example = np.array([[1.0, 2.0]])  # 1 sample, 2 features

# Initialize weights and biases for a 2-layer network
# Architecture: 2 inputs ‚Üí 3 hidden neurons ‚Üí 1 output
W1 = np.random.randn(2, 3) * 0.1  # Input to hidden
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1) * 0.1  # Hidden to output  
b2 = np.zeros((1, 1))

print("üî¢ Forward Propagation Example:")
print(f"Input X: {X_example}")
print(f"W1 shape: {W1.shape}, b1 shape: {b1.shape}")
print(f"W2 shape: {W2.shape}, b2 shape: {b2.shape}")

# Step 1: Input to hidden layer
z1 = X_example @ W1 + b1
print(f"\nStep 1 - Linear transformation (z1): {z1}")

# Step 2: Apply activation (ReLU)
a1 = relu(z1)
print(f"Step 2 - Activation (a1 = ReLU(z1)): {a1}")

# Step 3: Hidden to output layer
z2 = a1 @ W2 + b2
print(f"Step 3 - Linear transformation (z2): {z2}")

# Step 4: Output activation (sigmoid for binary classification)
y_pred = sigmoid(z2)
print(f"Step 4 - Output (≈∑ = sigmoid(z2)): {y_pred}")

print(f"\nüéØ Final prediction: {y_pred[0,0]:.4f} (probability of class 1)")

### Step 9Ô∏è‚É£ ‚Äî The Chain Rule: Mathematical Foundation of Backpropagation

In [None]:
# Visualize computational graph
computational_graph = """
Forward Pass:
X ‚Üí z‚ÇÅ = XW‚ÇÅ + b‚ÇÅ ‚Üí a‚ÇÅ = œÉ(z‚ÇÅ) ‚Üí z‚ÇÇ = a‚ÇÅW‚ÇÇ + b‚ÇÇ ‚Üí ≈∑ = œÉ(z‚ÇÇ) ‚Üí Loss

Backward Pass (Chain Rule):
‚àÇLoss/‚àÇW‚ÇÇ = ‚àÇLoss/‚àÇ≈∑ ¬∑ ‚àÇ≈∑/‚àÇz‚ÇÇ ¬∑ ‚àÇz‚ÇÇ/‚àÇW‚ÇÇ
‚àÇLoss/‚àÇW‚ÇÅ = ‚àÇLoss/‚àÇ≈∑ ¬∑ ‚àÇ≈∑/‚àÇz‚ÇÇ ¬∑ ‚àÇz‚ÇÇ/‚àÇa‚ÇÅ ¬∑ ‚àÇa‚ÇÅ/‚àÇz‚ÇÅ ¬∑ ‚àÇz‚ÇÅ/‚àÇW‚ÇÅ

Each step uses local gradients!
"""

print("üßÆ Computational Graph & Chain Rule:")
print(computational_graph)

# Example chain rule calculation
print("\nüìê Example: Computing ‚àÇLoss/‚àÇW‚ÇÅ step by step:")
steps = [
    "1. ‚àÇLoss/‚àÇ≈∑ = - (y/≈∑ - (1-y)/(1-≈∑))",
    "2. ‚àÇ≈∑/‚àÇz‚ÇÇ = ≈∑(1-≈∑)  (sigmoid derivative)",
    "3. ‚àÇz‚ÇÇ/‚àÇa‚ÇÅ = W‚ÇÇ",
    "4. ‚àÇa‚ÇÅ/‚àÇz‚ÇÅ = œÉ'(z‚ÇÅ)  (activation derivative)", 
    "5. ‚àÇz‚ÇÅ/‚àÇW‚ÇÅ = X",
    "6. Multiply all: ‚àÇLoss/‚àÇW‚ÇÅ = X ¬∑ œÉ'(z‚ÇÅ) ¬∑ W‚ÇÇ ¬∑ ≈∑(1-≈∑) ¬∑ (≈∑ - y)"
]

for step in steps:
    print(step)

### Step üîü ‚Äî Implementing Multi-Layer Perceptron (MLP)

In [None]:
class NeuralNetwork:
    def __init__(self, layer_sizes, activation='relu', learning_rate=0.01):
        """
        Initialize neural network
        
        Args:
            layer_sizes: list of layer sizes [input_size, hidden1_size, ..., output_size]
            activation: activation function ('relu', 'tanh', 'sigmoid')
            learning_rate: learning rate for gradient descent
        """
        self.layer_sizes = layer_sizes
        self.activation_name = activation
        self.lr = learning_rate
        self.losses = []
        self.accuracies = []
        
        # Initialize weights and biases
        self.weights = []
        self.biases = []
        
        for i in range(len(layer_sizes) - 1):
            # He initialization for ReLU, Xavier for tanh/sigmoid
            if activation == 'relu':
                scale = np.sqrt(2.0 / layer_sizes[i])
            else:
                scale = np.sqrt(1.0 / layer_sizes[i])
                
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * scale
            b = np.zeros((1, layer_sizes[i+1]))
            
            self.weights.append(W)
            self.biases.append(b)
    
    def forward(self, X):
        """Forward propagation through all layers"""
        self.activations = [X]  # Store all activations for backprop
        self.z_values = []      # Store all linear outputs
        
        # Hidden layers
        for i in range(len(self.weights) - 1):
            z = self.activations[-1] @ self.weights[i] + self.biases[i]
            self.z_values.append(z)
            
            if self.activation_name == 'relu':
                a = relu(z)
            elif self.activation_name == 'tanh':
                a = tanh(z)
            elif self.activation_name == 'sigmoid':
                a = sigmoid(z)
                
            self.activations.append(a)
        
        # Output layer (always sigmoid for binary classification)
        z_output = self.activations[-1] @ self.weights[-1] + self.biases[-1]
        self.z_values.append(z_output)
        output = sigmoid(z_output)
        self.activations.append(output)
        
        return output
    
    def compute_loss(self, y_pred, y_true):
        """Binary cross-entropy loss"""
        y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def compute_accuracy(self, y_pred, y_true):
        """Classification accuracy"""
        predictions = (y_pred >= 0.5).astype(int)
        return np.mean(predictions == y_true) * 100
    
    def backward(self, X, y_true):
        """Backward propagation"""
        m = len(y_true)
        
        # Initialize gradients
        dW = [np.zeros_like(W) for W in self.weights]
        db = [np.zeros_like(b) for b in self.biases]
        
        # Output layer gradient
        dZ_output = self.activations[-1] - y_true  # ‚àÇLoss/‚àÇz_output
        
        # Backpropagate through layers
        for l in range(len(self.weights) - 1, -1, -1):
            # Gradients for weights and biases
            dW[l] = (1/m) * (self.activations[l].T @ dZ_output)
            db[l] = (1/m) * np.sum(dZ_output, axis=0, keepdims=True)
            
            if l > 0:  # Continue backpropagation
                # Gradient for previous layer
                dA_prev = dZ_output @ self.weights[l].T
                
                # Gradient through activation function
                if self.activation_name == 'relu':
                    dZ_prev = dA_prev * relu_derivative(self.z_values[l-1])
                elif self.activation_name == 'tanh':
                    dZ_prev = dA_prev * tanh_derivative(self.z_values[l-1])
                elif self.activation_name == 'sigmoid':
                    dZ_prev = dA_prev * sigmoid_derivative(self.z_values[l-1])
                
                dZ_output = dZ_prev
        
        return dW, db
    
    def update_parameters(self, dW, db):
        """Update weights and biases using gradients"""
        for i in range(len(self.weights)):
            self.weights[i] -= self.lr * dW[i]
            self.biases[i] -= self.lr * db[i]
    
    def fit(self, X, y, epochs=1000, verbose=True):
        """Train the neural network"""
        for epoch in range(epochs):
            # Forward pass
            y_pred = self.forward(X)
            
            # Compute loss and accuracy
            loss = self.compute_loss(y_pred, y)
            accuracy = self.compute_accuracy(y_pred, y)
            
            self.losses.append(loss)
            self.accuracies.append(accuracy)
            
            # Backward pass
            dW, db = self.backward(X, y)
            
            # Update parameters
            self.update_parameters(dW, db)
            
            if verbose and (epoch % 100 == 0 or epoch == epochs - 1):
                print(f"Epoch {epoch:4d} | Loss: {loss:.4f} | Accuracy: {accuracy:.2f}%")
    
    def predict(self, X):
        """Make predictions"""
        y_pred = self.forward(X)
        return (y_pred >= 0.5).astype(int)
    
    def predict_proba(self, X):
        """Get probability predictions"""
        return self.forward(X)

### Step 1Ô∏è‚É£1Ô∏è‚É£ ‚Äî Train Neural Network on Moons Dataset

In [None]:
# Create and train neural network
print("üöÄ Training Neural Network on Moons Dataset...")
nn_model = NeuralNetwork(layer_sizes=[2, 10, 5, 1], activation='relu', learning_rate=0.1)
nn_model.fit(X_moons, y_moons, epochs=2000, verbose=True)

print(f"\n‚úÖ Training Complete!")
print(f"Final Loss: {nn_model.losses[-1]:.4f}")
print(f"Final Accuracy: {nn_model.accuracies[-1]:.2f}%")

### Step 1Ô∏è‚É£2Ô∏è‚É£ ‚Äî Visualize Neural Network Training

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(nn_model.losses, color='red', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Binary Cross-Entropy Loss')
axes[0].set_title('üìâ Neural Network Training Loss')
axes[0].grid(True, alpha=0.3)

# Accuracy curve
axes[1].plot(nn_model.accuracies, color='green', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy (%)')
axes[1].set_title('üìà Neural Network Training Accuracy')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Step 1Ô∏è‚É£3Ô∏è‚É£ ‚Äî Visualize Neural Network Decision Boundary

In [None]:
def plot_nn_decision_boundary(model, X, y, title):
    """Plot decision boundary for neural network"""
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    plt.contour(xx, yy, Z, colors='black', linewidths=2, levels=[0.5])
    
    plt.scatter(X[y.flatten() == 0, 0], X[y.flatten() == 0, 1], 
                color='blue', label='Class 0', alpha=0.6, s=50)
    plt.scatter(X[y.flatten() == 1, 0], X[y.flatten() == 1, 1], 
                color='red', label='Class 1', alpha=0.6, s=50)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

plot_nn_decision_boundary(nn_model, X_moons, y_moons, 
                         "üéØ Neural Network Decision Boundary (Non-Linear!)")

### Step 1Ô∏è‚É£4Ô∏è‚É£ ‚Äî Compare with Linear Model

In [None]:
# Compare performances
linear_accuracy = linear_model.compute_accuracy(linear_model.forward(X_moons), y_moons)
nn_accuracy = nn_model.accuracies[-1]

print("üìä MODEL COMPARISON ON MOONS DATASET")
print("=" * 50)
print(f"{'Model':<25} {'Accuracy':<15} {'Decision Boundary'}")
print("-" * 50)
print(f"{'Logistic Regression':<25} {linear_accuracy:<15.2f}% {'Linear ‚ùå'}")
print(f"{'Neural Network':<25} {nn_accuracy:<15.2f}% {'Non-Linear ‚úÖ'}")
print("=" * 50)

## üî¨ NEW: PyTorch Validation & Benchmarking

### Step 1Ô∏è‚É£5Ô∏è‚É£ ‚Äî PyTorch Implementation for Comparison

In [None]:
# PyTorch implementation of the same neural network
class PyTorchMLP(nn.Module):
    def __init__(self, layer_sizes, activation='relu'):
        super(PyTorchMLP, self).__init__()
        
        layers = []
        for i in range(len(layer_sizes) - 1):
            layers.append(nn.Linear(layer_sizes[i], layer_sizes[i+1]))
            if i < len(layer_sizes) - 2:  # Hidden layers
                if activation == 'relu':
                    layers.append(nn.ReLU())
                elif activation == 'tanh':
                    layers.append(nn.Tanh())
                elif activation == 'sigmoid':
                    layers.append(nn.Sigmoid())
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return torch.sigmoid(self.network(x))  # Sigmoid for binary classification

# Create PyTorch model with same architecture
pytorch_model = PyTorchMLP([2, 10, 5, 1], activation='relu')
criterion = nn.BCELoss()
optimizer = optim.SGD(pytorch_model.parameters(), lr=0.1)

print("ü§ñ PyTorch Model Architecture:")
print(pytorch_model)
print(f"\nüîß Total parameters: {sum(p.numel() for p in pytorch_model.parameters())}")

# Train PyTorch model
print("\nüöÄ Training PyTorch Model...")
pytorch_losses = []
pytorch_accuracies = []

for epoch in range(2000):
    # Forward pass
    outputs = pytorch_model(X_moons_tensor)
    loss = criterion(outputs, y_moons_tensor)
    
    # Backward pass and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Compute accuracy
    predicted = (outputs >= 0.5).float()
    accuracy = (predicted == y_moons_tensor).float().mean() * 100
    
    pytorch_losses.append(loss.item())
    pytorch_accuracies.append(accuracy.item())
    
    if epoch % 400 == 0 or epoch == 1999:
        print(f"Epoch {epoch:4d} | Loss: {loss.item():.4f} | Accuracy: {accuracy.item():.2f}%")

print(f"\n‚úÖ PyTorch Training Complete!")
print(f"Final Loss: {pytorch_losses[-1]:.4f}")
print(f"Final Accuracy: {pytorch_accuracies[-1]:.2f}%")

### Step 1Ô∏è‚É£6Ô∏è‚É£ ‚Äî Compare From-Scratch vs PyTorch Implementation

In [None]:
# Compare our implementation with PyTorch
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Loss comparison
axes[0, 0].plot(nn_model.losses, 'b-', label='From Scratch', alpha=0.7, linewidth=2)
axes[0, 0].plot(pytorch_losses, 'r-', label='PyTorch', alpha=0.7, linewidth=2)
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('üìâ Loss Comparison: From Scratch vs PyTorch')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Accuracy comparison
axes[0, 1].plot(nn_model.accuracies, 'b-', label='From Scratch', alpha=0.7, linewidth=2)
axes[0, 1].plot(pytorch_accuracies, 'r-', label='PyTorch', alpha=0.7, linewidth=2)
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy (%)')
axes[0, 1].set_title('üìà Accuracy Comparison: From Scratch vs PyTorch')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Decision boundary - From Scratch
h = 0.02
x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# From Scratch prediction
Z_scratch = nn_model.predict(np.c_[xx.ravel(), yy.ravel()])
Z_scratch = Z_scratch.reshape(xx.shape)

axes[1, 0].contourf(xx, yy, Z_scratch, alpha=0.3, cmap='RdBu')
axes[1, 0].scatter(X_moons[y_moons.flatten() == 0, 0], X_moons[y_moons.flatten() == 0, 1], 
                   color='blue', label='Class 0', alpha=0.6, s=30)
axes[1, 0].scatter(X_moons[y_moons.flatten() == 1, 0], X_moons[y_moons.flatten() == 1, 1], 
                   color='red', label='Class 1', alpha=0.6, s=30)
axes[1, 0].set_title(f'From Scratch ({nn_model.accuracies[-1]:.1f}%)')
axes[1, 0].legend()

# PyTorch prediction
with torch.no_grad():
    Z_pytorch = pytorch_model(torch.FloatTensor(np.c_[xx.ravel(), yy.ravel()]))
    Z_pytorch = (Z_pytorch >= 0.5).float().numpy().reshape(xx.shape)

axes[1, 1].contourf(xx, yy, Z_pytorch, alpha=0.3, cmap='RdBu')
axes[1, 1].scatter(X_moons[y_moons.flatten() == 0, 0], X_moons[y_moons.flatten() == 0, 1], 
                   color='blue', label='Class 0', alpha=0.6, s=30)
axes[1, 1].scatter(X_moons[y_moons.flatten() == 1, 0], X_moons[y_moons.flatten() == 1, 1], 
                   color='red', label='Class 1', alpha=0.6, s=30)
axes[1, 1].set_title(f'PyTorch ({pytorch_accuracies[-1]:.1f}%)')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

# Performance comparison
print("\n" + "=" * 70)
print("üìä FROM-SCRATCH vs PYTORCH BENCHMARK RESULTS")
print("=" * 70)
print(f"\n{'Metric':<25} {'From Scratch':<15} {'PyTorch':<15} {'Difference'}")
print("-" * 70)
print(f"{'Final Loss':<25} {nn_model.losses[-1]:<15.4f} {pytorch_losses[-1]:<15.4f} {nn_model.losses[-1] - pytorch_losses[-1]:+.4f}")
print(f"{'Final Accuracy (%)':<25} {nn_model.accuracies[-1]:<15.2f} {pytorch_accuracies[-1]:<15.2f} {nn_model.accuracies[-1] - pytorch_accuracies[-1]:+.2f}")
print(f"{'Architecture':<25} {'2-10-5-1':<15} {'2-10-5-1':<15} {'Same'}")
print("=" * 70)

print("\nüéØ Key Insights:")
print("‚úÖ Both implementations achieve similar performance")
print("‚úÖ Our from-scratch implementation correctly mimics neural network behavior")
print("‚úÖ PyTorch provides automatic differentiation and GPU support")
print("‚úÖ Understanding fundamentals helps appreciate PyTorch's magic!")

### Step 1Ô∏è‚É£7Ô∏è‚É£ ‚Äî Test on Circles Dataset with PyTorch

In [None]:
# Test on circles dataset
print("\nüß™ Testing on Circles Dataset...")

# Linear model (should fail)
linear_circles = LogisticRegression(n_features=2, lr=0.1)
linear_circles.fit(X_circles, y_circles, epochs=1000, verbose=False)
linear_circles_acc = linear_circles.compute_accuracy(linear_circles.forward(X_circles), y_circles)

# Neural network (should succeed)
nn_circles = NeuralNetwork(layer_sizes=[2, 20, 10, 1], activation='tanh', learning_rate=0.1)
nn_circles.fit(X_circles, y_circles, epochs=2000, verbose=False)
nn_circles_acc = nn_circles.accuracies[-1]

# PyTorch model for circles
pytorch_circles = PyTorchMLP([2, 20, 10, 1], activation='tanh')
optimizer_circles = optim.SGD(pytorch_circles.parameters(), lr=0.1)

pytorch_circles_losses = []
pytorch_circles_accuracies = []

for epoch in range(2000):
    outputs = pytorch_circles(X_circles_tensor)
    loss = criterion(outputs, y_circles_tensor)
    
    optimizer_circles.zero_grad()
    loss.backward()
    optimizer_circles.step()
    
    predicted = (outputs >= 0.5).float()
    accuracy = (predicted == y_circles_tensor).float().mean() * 100
    
    pytorch_circles_losses.append(loss.item())
    pytorch_circles_accuracies.append(accuracy.item())

pytorch_circles_acc = pytorch_circles_accuracies[-1]

print(f"üìä Circles Dataset Results:")
print(f"Linear Model Accuracy: {linear_circles_acc:.2f}%")
print(f"Neural Network (From Scratch) Accuracy: {nn_circles_acc:.2f}%")
print(f"Neural Network (PyTorch) Accuracy: {pytorch_circles_acc:.2f}%")

print("\n" + "=" * 60)
print("üéØ CIRCLES DATASET: LINEAR vs NEURAL NETWORKS")
print("=" * 60)
print(f"{'Model':<30} {'Accuracy':<15} {'Can Learn Circles?'}")
print("-" * 60)
print(f"{'Linear Model':<30} {linear_circles_acc:<15.2f}% {'‚ùå No'}")
print(f"{'Neural Network (From Scratch)':<30} {nn_circles_acc:<15.2f}% {'‚úÖ Yes'}")
print(f"{'Neural Network (PyTorch)':<30} {pytorch_circles_acc:<15.2f}% {'‚úÖ Yes'}")
print("=" * 60)

## Part 3: Advanced Concepts and Applications

### Step 1Ô∏è‚É£8Ô∏è‚É£ ‚Äî Multi-Class Neural Network

In [None]:
class MultiClassNeuralNetwork:
    def __init__(self, layer_sizes, activation='relu', learning_rate=0.01):
        """Neural network for multi-class classification"""
        self.layer_sizes = layer_sizes
        self.activation_name = activation
        self.lr = learning_rate
        self.losses = []
        self.accuracies = []
        
        # Initialize weights and biases
        self.weights = []
        self.biases = []
        
        for i in range(len(layer_sizes) - 1):
            if activation == 'relu':
                scale = np.sqrt(2.0 / layer_sizes[i])
            else:
                scale = np.sqrt(1.0 / layer_sizes[i])
                
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * scale
            b = np.zeros((1, layer_sizes[i+1]))
            
            self.weights.append(W)
            self.biases.append(b)
    
    def softmax(self, z):
        """Softmax activation for output layer"""
        z_stable = z - np.max(z, axis=1, keepdims=True)
        exp_z = np.exp(z_stable)
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
    def forward(self, X):
        """Forward propagation"""
        self.activations = [X]
        self.z_values = []
        
        # Hidden layers
        for i in range(len(self.weights) - 1):
            z = self.activations[-1] @ self.weights[i] + self.biases[i]
            self.z_values.append(z)
            
            if self.activation_name == 'relu':
                a = relu(z)
            elif self.activation_name == 'tanh':
                a = tanh(z)
            elif self.activation_name == 'sigmoid':
                a = sigmoid(z)
                
            self.activations.append(a)
        
        # Output layer (softmax for multi-class)
        z_output = self.activations[-1] @ self.weights[-1] + self.biases[-1]
        self.z_values.append(z_output)
        output = self.softmax(z_output)
        self.activations.append(output)
        
        return output
    
    def compute_loss(self, y_pred, y_true):
        """Categorical cross-entropy loss"""
        if y_true.ndim == 1:
            y_true = np.eye(y_pred.shape[1])[y_true]
        
        y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
        return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
    
    def compute_accuracy(self, y_pred, y_true):
        """Classification accuracy"""
        pred_classes = np.argmax(y_pred, axis=1)
        if y_true.ndim == 2:
            true_classes = np.argmax(y_true, axis=1)
        else:
            true_classes = y_true
        return np.mean(pred_classes == true_classes) * 100
    
    def backward(self, X, y_true):
        """Backward propagation for multi-class"""
        m = len(y_true)
        
        if y_true.ndim == 1:
            y_true = np.eye(self.weights[-1].shape[1])[y_true]
        
        dW = [np.zeros_like(W) for W in self.weights]
        db = [np.zeros_like(b) for b in self.biases]
        
        # Output layer gradient (softmax + cross-entropy has nice derivative)
        dZ_output = self.activations[-1] - y_true
        
        for l in range(len(self.weights) - 1, -1, -1):
            dW[l] = (1/m) * (self.activations[l].T @ dZ_output)
            db[l] = (1/m) * np.sum(dZ_output, axis=0, keepdims=True)
            
            if l > 0:
                dA_prev = dZ_output @ self.weights[l].T
                
                if self.activation_name == 'relu':
                    dZ_prev = dA_prev * relu_derivative(self.z_values[l-1])
                elif self.activation_name == 'tanh':
                    dZ_prev = dA_prev * tanh_derivative(self.z_values[l-1])
                elif self.activation_name == 'sigmoid':
                    dZ_prev = dA_prev * sigmoid_derivative(self.z_values[l-1])
                
                dZ_output = dZ_prev
        
        return dW, db
    
    def update_parameters(self, dW, db):
        """Update weights and biases"""
        for i in range(len(self.weights)):
            self.weights[i] -= self.lr * dW[i]
            self.biases[i] -= self.lr * db[i]
    
    def fit(self, X, y, epochs=1000, verbose=True):
        """Train the network"""
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss = self.compute_loss(y_pred, y)
            accuracy = self.compute_accuracy(y_pred, y)
            
            self.losses.append(loss)
            self.accuracies.append(accuracy)
            
            dW, db = self.backward(X, y)
            self.update_parameters(dW, db)
            
            if verbose and (epoch % 100 == 0 or epoch == epochs - 1):
                print(f"Epoch {epoch:4d} | Loss: {loss:.4f} | Accuracy: {accuracy:.2f}%")
    
    def predict(self, X):
        """Predict class labels"""
        y_pred = self.forward(X)
        return np.argmax(y_pred, axis=1)
    
    def predict_proba(self, X):
        """Predict class probabilities"""
        return self.forward(X)

### Step 1Ô∏è‚É£9Ô∏è‚É£ ‚Äî Test Multi-Class Neural Network on Iris Dataset

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print("üå∏ Iris Dataset:")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"Samples: {len(X_iris)}, Features: {X_iris.shape[1]}, Classes: {len(np.unique(y_iris))}")

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multi-class neural network
print("\nüöÄ Training Multi-Class Neural Network on Iris Dataset...")
multi_nn = MultiClassNeuralNetwork(
    layer_sizes=[4, 10, 8, 3],  # 4 inputs, 2 hidden layers, 3 outputs
    activation='relu', 
    learning_rate=0.01
)
multi_nn.fit(X_train_scaled, y_train, epochs=2000, verbose=True)

# Evaluate
y_pred = multi_nn.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred) * 100

print(f"\n‚úÖ Test Accuracy: {test_accuracy:.2f}%")

### Step 2Ô∏è‚É£0Ô∏è‚É£ ‚Äî PyTorch Multi-Class Implementation

In [None]:
# PyTorch multi-class implementation
class PyTorchMultiClassMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, num_classes):
        super(PyTorchMultiClassMLP, self).__init__()
        
        layers = []
        prev_size = input_size
        
        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            prev_size = hidden_size
        
        layers.append(nn.Linear(prev_size, num_classes))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)

# Prepare PyTorch data
X_iris_tensor = torch.FloatTensor(X_train_scaled)
y_iris_tensor = torch.LongTensor(y_train)  # Use LongTensor for classification
X_test_tensor = torch.FloatTensor(X_test_scaled)

# Create and train PyTorch model
pytorch_multi = PyTorchMultiClassMLP(input_size=4, hidden_sizes=[10, 8], num_classes=3)
criterion_multi = nn.CrossEntropyLoss()
optimizer_multi = optim.SGD(pytorch_multi.parameters(), lr=0.01)

print("ü§ñ Training PyTorch Multi-Class Model...")
pytorch_multi_losses = []
pytorch_multi_accuracies = []

for epoch in range(2000):
    outputs = pytorch_multi(X_iris_tensor)
    loss = criterion_multi(outputs, y_iris_tensor)
    
    optimizer_multi.zero_grad()
    loss.backward()
    optimizer_multi.step()
    
    _, predicted = torch.max(outputs.data, 1)
    accuracy = (predicted == y_iris_tensor).float().mean() * 100
    
    pytorch_multi_losses.append(loss.item())
    pytorch_multi_accuracies.append(accuracy.item())
    
    if epoch % 400 == 0 or epoch == 1999:
        print(f"Epoch {epoch:4d} | Loss: {loss.item():.4f} | Accuracy: {accuracy.item():.2f}%")

# Evaluate PyTorch model
with torch.no_grad():
    test_outputs = pytorch_multi(X_test_tensor)
    _, test_predicted = torch.max(test_outputs.data, 1)
    pytorch_test_accuracy = (test_predicted == torch.LongTensor(y_test)).float().mean() * 100

print(f"\n‚úÖ PyTorch Test Accuracy: {pytorch_test_accuracy:.2f}%")

# Compare multi-class results
print("\n" + "=" * 70)
print("üìä MULTI-CLASS NEURAL NETWORK COMPARISON")
print("=" * 70)
print(f"\n{'Model':<30} {'Test Accuracy':<15}")
print("-" * 70)
print(f"{'From Scratch':<30} {test_accuracy:<15.2f}%")
print(f"{'PyTorch':<30} {pytorch_test_accuracy:<15.2f}%")
print("=" * 70)

# Plot learning curves comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(multi_nn.losses, 'b-', label='From Scratch', alpha=0.7, linewidth=2)
axes[0].plot(pytorch_multi_losses, 'r-', label='PyTorch', alpha=0.7, linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Categorical Cross-Entropy Loss')
axes[0].set_title('üìâ Multi-Class Training Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy curve
axes[1].plot(multi_nn.accuracies, 'b-', label='From Scratch', alpha=0.7, linewidth=2)
axes[1].plot(pytorch_multi_accuracies, 'r-', label='PyTorch', alpha=0.7, linewidth=2)
axes[1].axhline(y=test_accuracy, color='blue', linestyle='--', 
               label=f'From Scratch Test ({test_accuracy:.1f}%)', alpha=0.7)
axes[1].axhline(y=pytorch_test_accuracy, color='red', linestyle='--', 
               label=f'PyTorch Test ({pytorch_test_accuracy:.1f}%)', alpha=0.7)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy (%)')
axes[1].set_title('üìà Multi-Class Training Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 4: Real-World Applications and Next Steps

### Step 2Ô∏è‚É£1Ô∏è‚É£ ‚Äî When to Use Neural Networks

In [None]:
print("üéØ WHEN TO USE NEURAL NETWORKS:")
print("\n‚úÖ Good use cases:")
print("‚Ä¢ Complex non-linear patterns")
print("‚Ä¢ Large datasets (thousands+ samples)")
print("‚Ä¢ Image, audio, text data")
print("‚Ä¢ Hierarchical feature learning needed")
print("‚Ä¢ State-of-the-art performance required")

print("\n‚ö†Ô∏è When to consider simpler models:")
print("‚Ä¢ Small datasets (< thousands of samples)")
print("‚Ä¢ Tabular data with clear linear relationships")  
print("‚Ä¢ Need for model interpretability")
print("‚Ä¢ Limited computational resources")
print("‚Ä¢ Quick prototyping")

print("\nüìä Model Selection Guide:")
print("Linear/Logistic Regression ‚Üí Random Forest ‚Üí Neural Networks")

### Step 2Ô∏è‚É£2Ô∏è‚É£ ‚Äî Common Architectures and Hyperparameters

In [None]:
# Common neural network architectures
architectures = {
    "Shallow Network": [10, 1],
    "Medium Network": [64, 32, 1], 
    "Deep Network": [128, 64, 32, 16, 1],
    "Wide & Shallow": [256, 1],
    "Narrow & Deep": [16, 16, 16, 16, 16, 1]
}

print("üèóÔ∏è COMMON NEURAL NETWORK ARCHITECTURES:")
for name, layers in architectures.items():
    print(f"‚Ä¢ {name}: {layers}")

print("\nüéõÔ∏è KEY HYPERPARAMETERS:")
hyperparams = {
    "Learning Rate": "0.001-0.1 (most important!)",
    "Hidden Layers": "1-5 for MLPs",
    "Layer Sizes": "32-512 neurons per layer", 
    "Activation": "ReLU (hidden), Softmax/Sigmoid (output)",
    "Batch Size": "32-256",
    "Epochs": "Until validation loss stops improving"
}

for param, recommendation in hyperparams.items():
    print(f"‚Ä¢ {param}: {recommendation}")

### Step 2Ô∏è‚É£3Ô∏è‚É£ ‚Äî From Scratch to Production Frameworks

In [None]:
print("üöÄ FROM SCRATCH TO PRODUCTION FRAMEWORKS:")
print("\nüß™ What we built from scratch:")
print("‚Ä¢ Forward propagation")
print("‚Ä¢ Backward propagation (backprop)")
print("‚Ä¢ Activation functions and derivatives") 
print("‚Ä¢ Weight initialization")
print("‚Ä¢ Gradient descent optimization")

print("\nüè≠ What PyTorch adds:")
print("‚Ä¢ Automatic differentiation (autograd) - No more manual chain rule!")
print("‚Ä¢ GPU acceleration - 10-100x speedup")
print("‚Ä¢ Pre-built layers and architectures - CNN, RNN, Transformers")
print("‚Ä¢ Advanced optimizers (Adam, RMSprop) - Better than plain SGD")
print("‚Ä¢ Regularization techniques - Dropout, BatchNorm")
print("‚Ä¢ Distributed training - Scale to multiple GPUs")
print("‚Ä¢ Model deployment tools - Export to production")

print("\nüîú Next Steps (Using PyTorch):")
print("1. Lecture 6: Convolutional Neural Networks (CNNs) for images")
print("2. Lecture 7: Recurrent Neural Networks (RNNs) for sequences") 
print("3. Lecture 8: Transformers for text and beyond!")
print("4. Lecture 9: Computer Vision applications")
print("5. Lecture 10: Natural Language Processing")

print("\nüéØ Why PyTorch for the rest of the course:")
print("‚úÖ Industry standard for research and production")
print("‚úÖ Pythonic and intuitive API")
print("‚úÖ Excellent documentation and community")
print("‚úÖ Seamless GPU acceleration")
print("‚úÖ Used by companies like Meta, Tesla, OpenAI")

## üéØ Summary: Complete Deep Learning Foundation

In [None]:
# Create a summary visualization
import pandas as pd

summary_data = {
    "Model": ["Linear Regression", "Logistic Regression", "Softmax Regression", "Neural Network", "PyTorch NN"],
    "Task": ["Regression", "Binary Classification", "Multi-Class Classification", "Complex Non-Linear", "Production Ready"],
    "Activation": ["None", "Sigmoid", "Softmax", "ReLU/Tanh/Sigmoid", "Auto-differentiation"],
    "Implementation": ["From Scratch", "From Scratch", "From Scratch", "From Scratch", "PyTorch Framework"],
    "Performance": ["Baseline", "Baseline", "Baseline", "Good", "Excellent"]
}

summary_df = pd.DataFrame(summary_data)
print("üìö MODEL EVOLUTION SUMMARY:")
print("=" * 100)
print(summary_df.to_string(index=False))
print("=" * 100)

### üß† Key Mathematical Insights:

1. **Universal Approximation Theorem**: A neural network with one hidden layer can approximate any continuous function given enough neurons
2. **Backpropagation**: Efficient gradient computation using chain rule
3. **Non-Linearity**: Activation functions enable learning complex patterns
4. **Hierarchical Features**: Each layer learns features at different abstraction levels

### üöÄ Path Forward:

This completes our foundation in **neural networks**:
- ‚úÖ Understanding biological inspiration
- ‚úÖ Implementing forward and backward propagation  
- ‚úÖ Working with different activation functions
- ‚úÖ Building both binary and multi-class networks
- ‚úÖ Visualizing decision boundaries and learning curves
- ‚úÖ **NEW: Validating against PyTorch implementation**

**Next up:** Convolutional Neural Networks (CNNs) for image data - using PyTorch!

---

*"Understanding the fundamentals from scratch makes you appreciate the magic of PyTorch!"*

## üß† Practice & Reflection ‚Äî Neural Networks from Scratch

Congratulations üéâ ‚Äî you have successfully built neural networks from scratch, understood the mathematics of backpropagation, and seen how they can learn complex non-linear patterns!

---

### üìå **Part 1 ‚Äî Core Concepts**

1. **Explain in your own words:**
   - Why do we need activation functions in neural networks?
   - What is the "vanishing gradient" problem and how does ReLU help?
   - How does backpropagation use the chain rule?
   - What's the difference between a single neuron and a neural network?
   - Why can neural networks learn non-linear patterns while linear models can't?

2. **Activation functions:**
   - When would you use ReLU vs Tanh vs Sigmoid?
   - Compute ReLU(2), ReLU(-2), Tanh(1), Sigmoid(0)
   - Why do we use softmax for multi-class output and sigmoid for binary?
   - What happens if we use linear activation in hidden layers?

---

### üìå **Part 2 ‚Äî Mathematical Understanding**

3. **Forward propagation:**
   - Given input [1, 2], weights [[0.5, -0.5], [0.3, 0.7]], bias [0.1, -0.1], compute the output with ReLU activation
   - Show the matrix dimensions at each layer for a 3-layer network

4. **Backpropagation:**
   - Derive the gradient for a simple 2-layer network step by step
   - Why is the gradient for softmax + cross-entropy so elegant?
   - How does weight initialization affect training?

---

### üìå **Part 3 ‚Äî Implementation Challenges**

5. **Architecture experiments:**
   - Try different architectures on moons dataset: [2,5,1], [2,10,5,1], [2,20,10,5,1]
   - Compare training time, final accuracy, and decision boundaries
   - Which works best and why?

6. **Activation function comparison:**
   - Train the same architecture with ReLU, Tanh, and Sigmoid
   - Plot loss curves for each
   - Discuss convergence speed and final performance

7. **Learning rate experiments:**
   - Try learning rates: 0.001, 0.01, 0.1, 1.0
   - Observe training stability and convergence
   - Identify signs of too high/too low learning rates

---

### üìå **Part 4 ‚Äî Advanced Applications**

8. **Regularization:**
   - Add L2 regularization to your neural network
   - Experiment with different regularization strengths
   - Observe effects on training vs test performance

9. **Weight visualization:**
   - Extract and visualize weights from different layers
   - Compare weights before and after training
   - What patterns do you notice?

10. **Custom dataset:**
    - Create your own non-linear classification dataset
    - Train both linear and neural network models
    - Compare performance and decision boundaries

---

## ‚úÖ **Goal of These Exercises**

By completing these challenges, you will:

- ‚úÖ Master the mathematics of neural networks and backpropagation
- ‚úÖ Understand how architecture choices affect model performance  
- ‚úÖ Build intuition for hyperparameter tuning
- ‚úÖ Gain confidence in implementing complex ML algorithms from scratch
- ‚úÖ Learn to diagnose and fix common training problems
- ‚úÖ **NEW: Appreciate what PyTorch does automatically**
- ‚úÖ Prepare for more advanced architectures (CNNs, RNNs, Transformers)

---

## üéì Next Steps

In **Lecture 6**, we'll dive into **Convolutional Neural Networks (CNNs)** for image data - using PyTorch!

---

## üôè Conclusion

You've now built **four fundamental ML algorithms from scratch**:
1. ‚úÖ Linear Regression (Lecture 1)
2. ‚úÖ Binary Logistic Regression (Lecture 4)  
3. ‚úÖ Multi-Class Softmax Regression (Lecture 4)
4. ‚úÖ Neural Networks (This lecture)

You understand:
- How gradient descent optimizes complex models
- The role of activation functions and non-linearity
- How backpropagation efficiently computes gradients
- How to implement, visualize, and benchmark neural networks
- The mathematical foundations of deep learning
- **NEW: How PyTorch automates these processes**

**This is the core of modern deep learning!**

All advanced architectures (CNNs, RNNs, Transformers) build upon these fundamental concepts.

Keep practicing, keep building, and get ready for PyTorch! üöÄ

In [None]:
# Final inspirational message
print("\n" + "‚ú®" * 70)
print("üéâ CONGRATULATIONS! You've built neural networks from scratch!")
print("üí™ You now understand the fundamentals of deep learning")
print("üî¨ You validated your implementation against PyTorch")
print("üöÄ Ready to tackle CNNs, RNNs, and beyond WITH PYTORCH!")
print("‚ú®" * 70)

print("\nüìö What's Next:")
print("‚Ä¢ Lecture 6: Convolutional Neural Networks (PyTorch)")
print("‚Ä¢ Lecture 7: Recurrent Neural Networks (PyTorch)")
print("‚Ä¢ Lecture 8: Transformers and Attention (PyTorch)")
print("‚Ä¢ Lecture 9: Computer Vision Applications")
print("‚Ä¢ Lecture 10: Natural Language Processing")

print("\nüåü Remember:")
print("Understanding from scratch + PyTorch power = Deep Learning Mastery!")
