# ML Practice Questions Part 8: Neural Networks Fundamentals

This notebook covers the mathematical foundations and practical implementation of neural networks, including forward propagation, backpropagation, and optimization techniques. Each question builds understanding from basic perceptrons to multi-layer networks.

**Topics Covered:**
- Perceptron and multi-layer perceptron architecture
- Forward propagation and activation functions
- Backpropagation algorithm and gradient computation
- Loss functions and optimization techniques
- Regularization and generalization in neural networks

**Format:** Each question includes theory, implementation, and analysis sections.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_classification, make_regression, make_moons, make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, log_loss
import seaborn as sns
from matplotlib.colors import ListedColormap
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
np.random.seed(42)

## Question 1: Perceptron and Forward Propagation

**Question:** Implement a multi-layer perceptron from scratch with different activation functions. Analyze how activation functions affect gradient flow and network expressiveness.

### Theory

**Multi-Layer Perceptron Architecture:**
- **Input layer**: $\mathbf{x} \in \mathbb{R}^{n}$
- **Hidden layers**: $\mathbf{h}^{(l)} = f(\mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$
- **Output layer**: $\mathbf{y} = g(\mathbf{W}^{(L)}\mathbf{h}^{(L-1)} + \mathbf{b}^{(L)})$

**Forward Propagation:**
$$\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$
$$\mathbf{a}^{(l)} = f(\mathbf{z}^{(l)})$$

**Activation Functions:**

**Sigmoid**: $\sigma(z) = \frac{1}{1 + e^{-z}}$, $\sigma'(z) = \sigma(z)(1-\sigma(z))$

**Tanh**: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$, $\tanh'(z) = 1 - \tanh^2(z)$

**ReLU**: $\text{ReLU}(z) = \max(0, z)$, $\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}$

**Leaky ReLU**: $\text{LeakyReLU}(z) = \max(\alpha z, z)$, where $\alpha = 0.01$

**Universal Approximation Theorem:**
A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$.

In [None]:
class ActivationFunction:
    """Base class for activation functions."""
    
    def forward(self, z):
        raise NotImplementedError
    
    def backward(self, z):
        raise NotImplementedError

class Sigmoid(ActivationFunction):
    def forward(self, z):
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def backward(self, z):
        s = self.forward(z)
        return s * (1 - s)

class Tanh(ActivationFunction):
    def forward(self, z):
        return np.tanh(z)
    
    def backward(self, z):
        return 1 - np.tanh(z) ** 2

class ReLU(ActivationFunction):
    def forward(self, z):
        return np.maximum(0, z)
    
    def backward(self, z):
        return (z > 0).astype(float)

class LeakyReLU(ActivationFunction):
    def __init__(self, alpha=0.01):
        self.alpha = alpha
    
    def forward(self, z):
        return np.where(z > 0, z, self.alpha * z)
    
    def backward(self, z):
        return np.where(z > 0, 1, self.alpha)

class MLPCustom:
    """Multi-Layer Perceptron implementation from scratch."""
    
    def __init__(self, hidden_sizes, activation='relu', output_activation='sigmoid', 
                 learning_rate=0.01, max_iter=1000, random_state=None):
        self.hidden_sizes = hidden_sizes
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.random_state = random_state
        
        # Set activation functions
        activations = {
            'sigmoid': Sigmoid(),
            'tanh': Tanh(),
            'relu': ReLU(),
            'leaky_relu': LeakyReLU()
        }
        
        self.activation = activations[activation]
        self.output_activation = activations[output_activation]
        
        self.weights = []
        self.biases = []
        self.loss_history = []
        
    def _initialize_parameters(self, input_size, output_size):
        """Initialize weights and biases using Xavier initialization."""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        # Build layer sizes
        layer_sizes = [input_size] + self.hidden_sizes + [output_size]
        
        self.weights = []
        self.biases = []
        
        for i in range(len(layer_sizes) - 1):
            # Xavier initialization
            limit = np.sqrt(6 / (layer_sizes[i] + layer_sizes[i + 1]))
            W = np.random.uniform(-limit, limit, (layer_sizes[i + 1], layer_sizes[i]))
            b = np.zeros((layer_sizes[i + 1], 1))
            
            self.weights.append(W)
            self.biases.append(b)
    
    def _forward_propagation(self, X):
        """Forward propagation through the network."""
        activations = [X.T]  # Transpose for column vectors
        z_values = []
        
        for i in range(len(self.weights)):
            # Linear transformation
            z = self.weights[i] @ activations[-1] + self.biases[i]
            z_values.append(z)
            
            # Apply activation function
            if i == len(self.weights) - 1:  # Output layer
                a = self.output_activation.forward(z)
            else:  # Hidden layers
                a = self.activation.forward(z)
            
            activations.append(a)
        
        return activations, z_values
    
    def _compute_cost(self, y_true, y_pred):
        """Compute binary cross-entropy loss."""
        m = y_true.shape[1]
        
        # Clip predictions to prevent log(0)
        y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
        
        cost = -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) / m
        return cost
    
    def _backward_propagation(self, X, y, activations, z_values):
        """Backward propagation to compute gradients."""
        m = X.shape[0]
        n_layers = len(self.weights)
        
        # Initialize gradients
        dW = [np.zeros_like(w) for w in self.weights]
        db = [np.zeros_like(b) for b in self.biases]
        
        # Output layer gradient
        dz = activations[-1] - y.T
        
        # Backpropagate through all layers
        for i in reversed(range(n_layers)):
            # Gradients for weights and biases
            dW[i] = dz @ activations[i].T / m
            db[i] = np.sum(dz, axis=1, keepdims=True) / m
            
            # Gradient for previous layer (if not input layer)
            if i > 0:
                da_prev = self.weights[i].T @ dz
                
                # Apply derivative of activation function
                if i == 1:  # First hidden layer
                    dz = da_prev * self.activation.backward(z_values[i-1])
                else:
                    dz = da_prev * self.activation.backward(z_values[i-1])
        
        return dW, db
    
    def fit(self, X, y):
        """Train the neural network."""
        X = np.array(X)
        y = np.array(y).reshape(-1, 1)
        
        # Initialize parameters
        self._initialize_parameters(X.shape[1], 1)
        
        self.loss_history = []
        
        for epoch in range(self.max_iter):
            # Forward propagation
            activations, z_values = self._forward_propagation(X)
            
            # Compute cost
            cost = self._compute_cost(y.T, activations[-1])
            self.loss_history.append(cost)
            
            # Backward propagation
            dW, db = self._backward_propagation(X, y, activations, z_values)
            
            # Update parameters
            for i in range(len(self.weights)):
                self.weights[i] -= self.learning_rate * dW[i]
                self.biases[i] -= self.learning_rate * db[i]
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        activations, _ = self._forward_propagation(X)
        return activations[-1].T
    
    def predict(self, X):
        """Make binary predictions."""
        probabilities = self.predict_proba(X)
        return (probabilities >= 0.5).astype(int).flatten()

# Generate datasets for testing
# Linear separable data
X_linear, y_linear = make_classification(n_samples=500, n_features=2, n_redundant=0, 
                                        n_informative=2, n_clusters_per_class=1, 
                                        class_sep=2.0, random_state=42)

# Non-linear data (moons)
X_moons, y_moons = make_moons(n_samples=500, noise=0.2, random_state=42)

# Non-linear data (circles)
X_circles, y_circles = make_circles(n_samples=500, noise=0.1, factor=0.3, random_state=42)

datasets = {
    'Linear': (X_linear, y_linear),
    'Moons': (X_moons, y_moons),
    'Circles': (X_circles, y_circles)
}

# Test different activation functions
activation_functions = ['sigmoid', 'tanh', 'relu', 'leaky_relu']
results = {}

for dataset_name, (X, y) in datasets.items():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Standardize features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    dataset_results = {}
    
    for activation in activation_functions:
        # Custom MLP
        mlp_custom = MLPCustom(
            hidden_sizes=[10, 5],
            activation=activation,
            learning_rate=0.1,
            max_iter=1000,
            random_state=42
        )
        
        mlp_custom.fit(X_train_scaled, y_train)
        y_pred_custom = mlp_custom.predict(X_test_scaled)
        accuracy_custom = accuracy_score(y_test, y_pred_custom)
        
        # Sklearn MLP for comparison
        mlp_sklearn = MLPClassifier(
            hidden_layer_sizes=(10, 5),
            activation=activation,
            learning_rate_init=0.1,
            max_iter=1000,
            random_state=42
        )
        
        mlp_sklearn.fit(X_train_scaled, y_train)
        y_pred_sklearn = mlp_sklearn.predict(X_test_scaled)
        accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn)
        
        dataset_results[activation] = {
            'custom_accuracy': accuracy_custom,
            'sklearn_accuracy': accuracy_sklearn,
            'final_loss': mlp_custom.loss_history[-1] if mlp_custom.loss_history else np.inf
        }
    
    results[dataset_name] = dataset_results

# Print results
for dataset_name, dataset_results in results.items():
    print(f"\n{dataset_name} Dataset Results:")
    df = pd.DataFrame(dataset_results).T
    print(df.round(4))

In [None]:
# Visualize activation functions and their derivatives
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

z = np.linspace(-5, 5, 1000)
activations = {
    'Sigmoid': Sigmoid(),
    'Tanh': Tanh(),
    'ReLU': ReLU(),
    'Leaky ReLU': LeakyReLU()
}

for i, (name, activation) in enumerate(activations.items()):
    # Activation function
    y = activation.forward(z)
    axes[0, i].plot(z, y, 'b-', linewidth=2, label=f'{name}')
    axes[0, i].set_title(f'{name} Activation')
    axes[0, i].set_xlabel('z')
    axes[0, i].set_ylabel('f(z)')
    axes[0, i].grid(True, alpha=0.3)
    axes[0, i].legend()
    
    # Derivative
    dy = activation.backward(z)
    axes[1, i].plot(z, dy, 'r-', linewidth=2, label=f"{name}' ")
    axes[1, i].set_title(f'{name} Derivative')
    axes[1, i].set_xlabel('z')
    axes[1, i].set_ylabel("f'(z)")
    axes[1, i].grid(True, alpha=0.3)
    axes[1, i].legend()

plt.tight_layout()
plt.show()

# Analyze gradient flow issues
print("\nGradient Flow Analysis:")
print("Sigmoid: Suffers from vanishing gradients (max derivative = 0.25)")
print("Tanh: Better than sigmoid but still vanishing gradients (max derivative = 1.0)")
print("ReLU: Solves vanishing gradients but has dying ReLU problem")
print("Leaky ReLU: Addresses dying ReLU with small negative slope")

## Question 2: Backpropagation Algorithm Implementation

**Question:** Implement the backpropagation algorithm step by step and verify gradients using numerical differentiation. Analyze how gradient computation scales with network depth.

### Theory

**Backpropagation Algorithm:**

**Chain Rule Application:**
$$\frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial z^{(l)}} \frac{\partial z^{(l)}}{\partial W^{(l)}}$$

**Layer-wise Gradient Computation:**
1. **Output layer error**: $\delta^{(L)} = \frac{\partial L}{\partial z^{(L)}} = \frac{\partial L}{\partial a^{(L)}} \odot f'(z^{(L)})$

2. **Hidden layer error**: $\delta^{(l)} = ((W^{(l+1)})^T \delta^{(l+1)}) \odot f'(z^{(l)})$

3. **Weight gradients**: $\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T$

4. **Bias gradients**: $\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}$

**Numerical Gradient Checking:**
$$\frac{\partial L}{\partial \theta} \approx \frac{L(\theta + \epsilon) - L(\theta - \epsilon)}{2\epsilon}$$

**Computational Complexity:**
- Forward pass: $O(W)$ where $W$ is total number of weights
- Backward pass: $O(W)$ (same as forward pass)
- Memory: $O(W + A)$ where $A$ is total activations stored

In [None]:
class BackpropagationAnalyzer:
    """Detailed backpropagation implementation with gradient checking."""
    
    def __init__(self, layer_sizes, activation='tanh'):
        self.layer_sizes = layer_sizes
        self.activation = Tanh() if activation == 'tanh' else ReLU()
        self.output_activation = Sigmoid()
        
        self.weights = []
        self.biases = []
        self.initialize_parameters()
        
        # For analysis
        self.gradient_norms = []
        self.activation_stats = []
        
    def initialize_parameters(self):
        """Initialize parameters with small random values."""
        np.random.seed(42)
        
        for i in range(len(self.layer_sizes) - 1):
            # He initialization for ReLU, Xavier for tanh
            if isinstance(self.activation, ReLU):
                std = np.sqrt(2.0 / self.layer_sizes[i])
            else:
                std = np.sqrt(1.0 / self.layer_sizes[i])
            
            W = np.random.normal(0, std, (self.layer_sizes[i+1], self.layer_sizes[i]))
            b = np.zeros((self.layer_sizes[i+1], 1))
            
            self.weights.append(W)
            self.biases.append(b)
    
    def forward_propagation_detailed(self, X):
        """Forward propagation with detailed tracking."""
        activations = [X.T]
        z_values = []
        
        for i in range(len(self.weights)):
            # Linear transformation
            z = self.weights[i] @ activations[-1] + self.biases[i]
            z_values.append(z)
            
            # Activation
            if i == len(self.weights) - 1:  # Output layer
                a = self.output_activation.forward(z)
            else:
                a = self.activation.forward(z)
            
            activations.append(a)
        
        return activations, z_values
    
    def backward_propagation_detailed(self, X, y, activations, z_values):
        """Detailed backward propagation with analysis."""
        m = X.shape[0]
        n_layers = len(self.weights)
        
        # Initialize gradients
        dW = [np.zeros_like(w) for w in self.weights]
        db = [np.zeros_like(b) for b in self.biases]
        deltas = []
        
        # Output layer gradient (for binary cross-entropy)
        delta = activations[-1] - y.T
        deltas.append(delta)
        
        # Backpropagate through all layers
        for i in reversed(range(n_layers)):
            # Gradients for current layer
            dW[i] = delta @ activations[i].T / m
            db[i] = np.sum(delta, axis=1, keepdims=True) / m
            
            # Propagate error to previous layer
            if i > 0:
                # Error propagation
                delta_prev = self.weights[i].T @ delta
                
                # Apply activation derivative
                activation_derivative = self.activation.backward(z_values[i-1])
                delta = delta_prev * activation_derivative
                
                deltas.append(delta)
        
        # Store gradient norms for analysis
        layer_grad_norms = []
        for i, dw in enumerate(dW):
            grad_norm = np.linalg.norm(dw)
            layer_grad_norms.append(grad_norm)
        
        self.gradient_norms.append(layer_grad_norms)
        
        return dW, db, deltas[::-1]  # Reverse to match layer order
    
    def numerical_gradient(self, X, y, epsilon=1e-7):
        """Compute numerical gradients for verification."""
        numerical_dW = []
        numerical_db = []
        
        for i in range(len(self.weights)):
            # Numerical gradient for weights
            dW_num = np.zeros_like(self.weights[i])
            
            for row in range(self.weights[i].shape[0]):
                for col in range(self.weights[i].shape[1]):
                    # Forward perturbation
                    self.weights[i][row, col] += epsilon
                    activations_plus, _ = self.forward_propagation_detailed(X)
                    loss_plus = self.compute_loss(y, activations_plus[-1])
                    
                    # Backward perturbation
                    self.weights[i][row, col] -= 2 * epsilon
                    activations_minus, _ = self.forward_propagation_detailed(X)
                    loss_minus = self.compute_loss(y, activations_minus[-1])
                    
                    # Numerical gradient
                    dW_num[row, col] = (loss_plus - loss_minus) / (2 * epsilon)
                    
                    # Restore original value
                    self.weights[i][row, col] += epsilon
            
            numerical_dW.append(dW_num)
            
            # Numerical gradient for biases (simplified - just first few)
            db_num = np.zeros_like(self.biases[i])
            for row in range(min(3, self.biases[i].shape[0])):  # Check first 3 biases only
                # Forward perturbation
                self.biases[i][row, 0] += epsilon
                activations_plus, _ = self.forward_propagation_detailed(X)
                loss_plus = self.compute_loss(y, activations_plus[-1])
                
                # Backward perturbation
                self.biases[i][row, 0] -= 2 * epsilon
                activations_minus, _ = self.forward_propagation_detailed(X)
                loss_minus = self.compute_loss(y, activations_minus[-1])
                
                # Numerical gradient
                db_num[row, 0] = (loss_plus - loss_minus) / (2 * epsilon)
                
                # Restore original value
                self.biases[i][row, 0] += epsilon
            
            numerical_db.append(db_num)
        
        return numerical_dW, numerical_db
    
    def compute_loss(self, y_true, y_pred):
        """Compute binary cross-entropy loss."""
        m = y_true.shape[0]
        y_pred = np.clip(y_pred.T, 1e-15, 1 - 1e-15)
        y_true = y_true.reshape(-1, 1)
        
        loss = -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) / m
        return loss
    
    def gradient_check(self, X, y, tolerance=1e-5):
        """Verify gradients using numerical differentiation."""
        # Compute analytical gradients
        activations, z_values = self.forward_propagation_detailed(X)
        dW_analytical, db_analytical, _ = self.backward_propagation_detailed(X, y, activations, z_values)
        
        # Compute numerical gradients (subset for efficiency)
        dW_numerical, db_numerical = self.numerical_gradient(X, y)
        
        # Compare gradients
        gradient_errors = []
        
        for i in range(len(dW_analytical)):
            # Weight gradients
            if dW_numerical[i].size > 0:
                diff_W = np.abs(dW_analytical[i] - dW_numerical[i])
                relative_error_W = np.max(diff_W / (np.abs(dW_analytical[i]) + np.abs(dW_numerical[i]) + 1e-10))
                gradient_errors.append(('Weight', i, relative_error_W))
            
            # Bias gradients (first few only)
            if db_numerical[i].size > 0:
                n_check = min(3, db_analytical[i].shape[0])
                diff_b = np.abs(db_analytical[i][:n_check] - db_numerical[i][:n_check])
                relative_error_b = np.max(diff_b / (np.abs(db_analytical[i][:n_check]) + np.abs(db_numerical[i][:n_check]) + 1e-10))
                gradient_errors.append(('Bias', i, relative_error_b))
        
        return gradient_errors

# Test gradient implementation
print("Testing Backpropagation Implementation:")
print("=" * 50)

# Create small test dataset
X_test = np.random.randn(20, 3)
y_test = (X_test[:, 0] + X_test[:, 1] > 0).astype(int)

# Test different network architectures
architectures = {
    'Shallow': [3, 5, 1],
    'Deep': [3, 10, 8, 5, 1],
    'Very Deep': [3, 12, 10, 8, 6, 4, 1]
}

for arch_name, layer_sizes in architectures.items():
    print(f"\n{arch_name} Network ({len(layer_sizes)-1} layers):")
    
    # Create analyzer
    analyzer = BackpropagationAnalyzer(layer_sizes, activation='tanh')
    
    # Check gradients
    gradient_errors = analyzer.gradient_check(X_test, y_test)
    
    # Print results
    max_error = max([error for _, _, error in gradient_errors])
    print(f"Maximum gradient error: {max_error:.2e}")
    
    if max_error < 1e-4:
        print("✓ Gradients are correct")
    else:
        print("✗ Gradient computation may have errors")
        
    # Analyze gradient flow
    activations, z_values = analyzer.forward_propagation_detailed(X_test)
    dW, db, deltas = analyzer.backward_propagation_detailed(X_test, y_test, activations, z_values)
    
    print(f"Gradient norms by layer: {[f'{norm:.2e}' for norm in analyzer.gradient_norms[-1]]}")

In [None]:
# Visualize gradient flow and vanishing gradient problem
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Test vanishing gradients with deep networks
depths = range(2, 11)
gradient_ratios_sigmoid = []
gradient_ratios_relu = []

for depth in depths:
    # Create deep network with sigmoid
    layer_sizes_sigmoid = [3] + [10] * (depth - 1) + [1]
    analyzer_sigmoid = BackpropagationAnalyzer(layer_sizes_sigmoid, activation='tanh')
    analyzer_sigmoid.activation = Sigmoid()  # Use sigmoid for vanishing gradient demo
    
    # Forward and backward pass
    activations, z_values = analyzer_sigmoid.forward_propagation_detailed(X_test)
    dW, db, deltas = analyzer_sigmoid.backward_propagation_detailed(X_test, y_test, activations, z_values)
    
    # Calculate gradient ratio (first layer / last layer)
    first_layer_grad = np.linalg.norm(dW[0])
    last_layer_grad = np.linalg.norm(dW[-1])
    ratio_sigmoid = first_layer_grad / (last_layer_grad + 1e-10)
    gradient_ratios_sigmoid.append(ratio_sigmoid)
    
    # Create deep network with ReLU
    analyzer_relu = BackpropagationAnalyzer(layer_sizes_sigmoid, activation='relu')
    
    # Forward and backward pass
    activations, z_values = analyzer_relu.forward_propagation_detailed(X_test)
    dW, db, deltas = analyzer_relu.backward_propagation_detailed(X_test, y_test, activations, z_values)
    
    # Calculate gradient ratio
    first_layer_grad = np.linalg.norm(dW[0])
    last_layer_grad = np.linalg.norm(dW[-1])
    ratio_relu = first_layer_grad / (last_layer_grad + 1e-10)
    gradient_ratios_relu.append(ratio_relu)

# Plot gradient ratios
axes[0, 0].semilogy(depths, gradient_ratios_sigmoid, 'o-', label='Sigmoid', linewidth=2, markersize=6)
axes[0, 0].semilogy(depths, gradient_ratios_relu, 's-', label='ReLU', linewidth=2, markersize=6)
axes[0, 0].set_xlabel('Network Depth')
axes[0, 0].set_ylabel('Gradient Ratio (First/Last Layer)')
axes[0, 0].set_title('Vanishing Gradient Problem')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Gradient flow visualization for specific network
deep_network = BackpropagationAnalyzer([3, 12, 10, 8, 6, 4, 1], activation='tanh')
activations, z_values = deep_network.forward_propagation_detailed(X_test)
dW, db, deltas = deep_network.backward_propagation_detailed(X_test, y_test, activations, z_values)

layer_indices = range(len(dW))
gradient_norms = [np.linalg.norm(dw) for dw in dW]

axes[0, 1].bar(layer_indices, gradient_norms, alpha=0.7, color='purple')
axes[0, 1].set_xlabel('Layer Index')
axes[0, 1].set_ylabel('Gradient Norm')
axes[0, 1].set_title('Gradient Magnitudes by Layer')
axes[0, 1].grid(True, alpha=0.3)

# Activation statistics
activation_means = [np.mean(np.abs(a)) for a in activations[1:-1]]  # Exclude input and output
activation_stds = [np.std(a) for a in activations[1:-1]]

hidden_layer_indices = range(len(activation_means))
axes[1, 0].bar(hidden_layer_indices, activation_means, alpha=0.7, color='green', label='Mean |activation|')
axes[1, 0].set_xlabel('Hidden Layer Index')
axes[1, 0].set_ylabel('Mean Absolute Activation')
axes[1, 0].set_title('Activation Statistics by Layer')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Computational complexity analysis
network_sizes = [(3, 10, 1), (3, 20, 10, 1), (3, 30, 20, 10, 1), (3, 40, 30, 20, 10, 1)]
forward_times = []
backward_times = []
total_parameters = []

import time

for sizes in network_sizes:
    analyzer = BackpropagationAnalyzer(list(sizes))
    
    # Count parameters
    n_params = sum(w.size + b.size for w, b in zip(analyzer.weights, analyzer.biases))
    total_parameters.append(n_params)
    
    # Time forward pass
    start_time = time.time()
    for _ in range(100):
        activations, z_values = analyzer.forward_propagation_detailed(X_test)
    forward_time = (time.time() - start_time) / 100
    forward_times.append(forward_time)
    
    # Time backward pass
    start_time = time.time()
    for _ in range(100):
        dW, db, deltas = analyzer.backward_propagation_detailed(X_test, y_test, activations, z_values)
    backward_time = (time.time() - start_time) / 100
    backward_times.append(backward_time)

axes[1, 1].plot(total_parameters, forward_times, 'o-', label='Forward Pass', linewidth=2, markersize=6)
axes[1, 1].plot(total_parameters, backward_times, 's-', label='Backward Pass', linewidth=2, markersize=6)
axes[1, 1].set_xlabel('Number of Parameters')
axes[1, 1].set_ylabel('Time per Pass (seconds)')
axes[1, 1].set_title('Computational Complexity')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nGradient Flow Analysis:")
print(f"Sigmoid networks show exponential decay in gradient ratios with depth")
print(f"ReLU networks maintain more stable gradients")
print(f"Backward pass time is approximately equal to forward pass time")
print(f"Both scale linearly with number of parameters")

## Question 3: Loss Functions and Optimization Techniques

**Question:** Compare different loss functions for neural networks and implement various optimization algorithms. Analyze their convergence properties and suitability for different problems.

### Theory

**Loss Functions:**

**Binary Cross-Entropy**: $L = -\frac{1}{m}\sum_{i=1}^m [y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$

**Categorical Cross-Entropy**: $L = -\frac{1}{m}\sum_{i=1}^m \sum_{c=1}^C y_{ic} \log(\hat{y}_{ic})$

**Mean Squared Error**: $L = \frac{1}{2m}\sum_{i=1}^m (y_i - \hat{y}_i)^2$

**Huber Loss**: $L_{\delta} = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$

**Optimization Algorithms:**

**SGD with Momentum**: 
$$v_t = \beta v_{t-1} + (1-\beta)\nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \alpha v_t$$

**Adam**:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla L(\theta_t)$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla L(\theta_t))^2$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t$$

In [None]:
class LossFunction:
    """Base class for loss functions."""
    
    def forward(self, y_true, y_pred):
        raise NotImplementedError
    
    def backward(self, y_true, y_pred):
        raise NotImplementedError

class BinaryCrossEntropy(LossFunction):
    def forward(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def backward(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
        return (y_pred - y_true) / (y_pred * (1 - y_pred)) / len(y_true)

class MeanSquaredError(LossFunction):
    def forward(self, y_true, y_pred):
        return np.mean((y_true - y_pred) ** 2)
    
    def backward(self, y_true, y_pred):
        return 2 * (y_pred - y_true) / len(y_true)

class HuberLoss(LossFunction):
    def __init__(self, delta=1.0):
        self.delta = delta
    
    def forward(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = np.abs(error) <= self.delta
        
        squared_loss = 0.5 * error ** 2
        linear_loss = self.delta * np.abs(error) - 0.5 * self.delta ** 2
        
        return np.mean(np.where(is_small_error, squared_loss, linear_loss))
    
    def backward(self, y_true, y_pred):
        error = y_pred - y_true
        is_small_error = np.abs(error) <= self.delta
        
        return np.where(is_small_error, error, self.delta * np.sign(error)) / len(y_true)

class Optimizer:
    """Base class for optimizers."""
    
    def update(self, params, gradients):
        raise NotImplementedError

class SGD(Optimizer):
    def __init__(self, learning_rate=0.01, momentum=0.0):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = None
    
    def update(self, params, gradients):
        if self.velocity is None:
            self.velocity = [np.zeros_like(p) for p in params]
        
        updated_params = []
        for i, (param, grad) in enumerate(zip(params, gradients)):
            self.velocity[i] = self.momentum * self.velocity[i] + (1 - self.momentum) * grad
            updated_param = param - self.learning_rate * self.velocity[i]
            updated_params.append(updated_param)
        
        return updated_params

class Adam(Optimizer):
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        
        self.m = None  # First moment estimate
        self.v = None  # Second moment estimate
        self.t = 0     # Time step
    
    def update(self, params, gradients):
        if self.m is None:
            self.m = [np.zeros_like(p) for p in params]
            self.v = [np.zeros_like(p) for p in params]
        
        self.t += 1
        updated_params = []
        
        for i, (param, grad) in enumerate(zip(params, gradients)):
            # Update biased first and second moment estimates
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * grad ** 2
            
            # Compute bias-corrected estimates
            m_hat = self.m[i] / (1 - self.beta1 ** self.t)
            v_hat = self.v[i] / (1 - self.beta2 ** self.t)
            
            # Update parameters
            updated_param = param - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
            updated_params.append(updated_param)
        
        return updated_params

class NeuralNetworkTrainer:
    """Neural network trainer with different loss functions and optimizers."""
    
    def __init__(self, layer_sizes, loss_function, optimizer, activation='relu'):
        self.layer_sizes = layer_sizes
        self.loss_function = loss_function
        self.optimizer = optimizer
        
        # Set activation functions
        activations = {
            'sigmoid': Sigmoid(),
            'tanh': Tanh(),
            'relu': ReLU()
        }
        self.activation = activations[activation]
        
        self.weights = []
        self.biases = []
        self.loss_history = []
        
    def initialize_parameters(self):
        """Initialize parameters."""
        np.random.seed(42)
        
        for i in range(len(self.layer_sizes) - 1):
            # He initialization for ReLU
            if isinstance(self.activation, ReLU):
                std = np.sqrt(2.0 / self.layer_sizes[i])
            else:
                std = np.sqrt(1.0 / self.layer_sizes[i])
            
            W = np.random.normal(0, std, (self.layer_sizes[i+1], self.layer_sizes[i]))
            b = np.zeros((self.layer_sizes[i+1], 1))
            
            self.weights.append(W)
            self.biases.append(b)
    
    def forward_propagation(self, X):
        """Forward propagation."""
        activations = [X.T]
        z_values = []
        
        for i in range(len(self.weights)):
            z = self.weights[i] @ activations[-1] + self.biases[i]
            z_values.append(z)
            
            if i == len(self.weights) - 1:  # Output layer
                a = z  # Linear output for regression, or add sigmoid for classification
            else:
                a = self.activation.forward(z)
            
            activations.append(a)
        
        return activations, z_values
    
    def backward_propagation(self, X, y, activations, z_values):
        """Backward propagation."""
        m = X.shape[0]
        n_layers = len(self.weights)
        
        dW = [np.zeros_like(w) for w in self.weights]
        db = [np.zeros_like(b) for b in self.biases]
        
        # Output layer gradient
        dz = self.loss_function.backward(y.T, activations[-1])
        
        for i in reversed(range(n_layers)):
            dW[i] = dz @ activations[i].T
            db[i] = np.sum(dz, axis=1, keepdims=True)
            
            if i > 0:
                da_prev = self.weights[i].T @ dz
                dz = da_prev * self.activation.backward(z_values[i-1])
        
        return dW, db
    
    def train(self, X, y, epochs=1000, verbose=False):
        """Train the neural network."""
        self.initialize_parameters()
        self.loss_history = []
        
        for epoch in range(epochs):
            # Forward pass
            activations, z_values = self.forward_propagation(X)
            
            # Compute loss
            loss = self.loss_function.forward(y, activations[-1].T)
            self.loss_history.append(loss)
            
            # Backward pass
            dW, db = self.backward_propagation(X, y, activations, z_values)
            
            # Flatten gradients for optimizer
            all_params = []
            all_gradients = []
            
            for w, b, dw, db_grad in zip(self.weights, self.biases, dW, db):
                all_params.extend([w, b])
                all_gradients.extend([dw, db_grad])
            
            # Update parameters
            updated_params = self.optimizer.update(all_params, all_gradients)
            
            # Restore parameter structure
            param_idx = 0
            for i in range(len(self.weights)):
                self.weights[i] = updated_params[param_idx]
                self.biases[i] = updated_params[param_idx + 1]
                param_idx += 2
            
            if verbose and epoch % 100 == 0:
                print(f"Epoch {epoch}, Loss: {loss:.6f}")
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        activations, _ = self.forward_propagation(X)
        return activations[-1].T

# Test different loss functions and optimizers
print("Comparing Loss Functions and Optimizers:")
print("=" * 50)

# Generate regression dataset
X_reg, y_reg = make_regression(n_samples=500, n_features=5, noise=10, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

# Standardize
scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_train_reg_scaled = scaler_X.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_X.transform(X_test_reg)
y_train_reg_scaled = scaler_y.fit_transform(y_train_reg.reshape(-1, 1)).flatten()
y_test_reg_scaled = scaler_y.transform(y_test_reg.reshape(-1, 1)).flatten()

# Test configurations
configurations = {
    'MSE + SGD': {
        'loss': MeanSquaredError(),
        'optimizer': SGD(learning_rate=0.01)
    },
    'MSE + SGD-Momentum': {
        'loss': MeanSquaredError(),
        'optimizer': SGD(learning_rate=0.01, momentum=0.9)
    },
    'MSE + Adam': {
        'loss': MeanSquaredError(),
        'optimizer': Adam(learning_rate=0.001)
    },
    'Huber + Adam': {
        'loss': HuberLoss(delta=1.0),
        'optimizer': Adam(learning_rate=0.001)
    }
}

training_results = {}
trained_models = {}

for config_name, config in configurations.items():
    print(f"\nTraining with {config_name}...")
    
    trainer = NeuralNetworkTrainer(
        layer_sizes=[5, 20, 10, 1],
        loss_function=config['loss'],
        optimizer=config['optimizer'],
        activation='relu'
    )
    
    trainer.train(X_train_reg_scaled, y_train_reg_scaled, epochs=500)
    
    # Evaluate
    y_pred_train = trainer.predict(X_train_reg_scaled)
    y_pred_test = trainer.predict(X_test_reg_scaled)
    
    train_mse = mean_squared_error(y_train_reg_scaled, y_pred_train)
    test_mse = mean_squared_error(y_test_reg_scaled, y_pred_test)
    
    training_results[config_name] = {
        'train_mse': train_mse,
        'test_mse': test_mse,
        'final_loss': trainer.loss_history[-1]
    }
    
    trained_models[config_name] = trainer
    print(f"Final loss: {trainer.loss_history[-1]:.6f}, Test MSE: {test_mse:.6f}")

# Print comparison
results_df = pd.DataFrame(training_results).T
print("\nComparison Results:")
print(results_df.round(6))

In [None]:
# Visualize training dynamics and loss landscapes
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Loss curves
for config_name, trainer in trained_models.items():
    axes[0, 0].plot(trainer.loss_history, label=config_name, alpha=0.8, linewidth=2)

axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Training Loss Curves')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_yscale('log')

# Convergence speed comparison
convergence_epochs = []
config_names = list(trained_models.keys())

for config_name, trainer in trained_models.items():
    # Find epoch where loss reaches 95% of final value
    final_loss = trainer.loss_history[-1]
    target_loss = final_loss * 1.05
    
    convergence_epoch = len(trainer.loss_history)
    for i, loss in enumerate(trainer.loss_history):
        if loss <= target_loss:
            convergence_epoch = i
            break
    
    convergence_epochs.append(convergence_epoch)

bars = axes[0, 1].bar(config_names, convergence_epochs, alpha=0.7, color='orange')
axes[0, 1].set_ylabel('Epochs to Convergence')
axes[0, 1].set_title('Convergence Speed')
axes[0, 1].tick_params(axis='x', rotation=45)
for bar, val in zip(bars, convergence_epochs):
    axes[0, 1].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 5,
                   f'{val}', ha='center', va='bottom')
axes[0, 1].grid(True, alpha=0.3)

# Loss function comparison
y_range = np.linspace(-3, 3, 100)
y_true_val = 0  # Target value

mse_loss = MeanSquaredError()
huber_loss = HuberLoss(delta=1.0)

mse_values = [mse_loss.forward(np.array([y_true_val]), np.array([y_pred])) for y_pred in y_range]
huber_values = [huber_loss.forward(np.array([y_true_val]), np.array([y_pred])) for y_pred in y_range]

axes[0, 2].plot(y_range, mse_values, 'b-', label='MSE', linewidth=2)
axes[0, 2].plot(y_range, huber_values, 'r-', label='Huber (δ=1)', linewidth=2)
axes[0, 2].axvline(x=y_true_val, color='black', linestyle='--', alpha=0.5, label='True value')
axes[0, 2].set_xlabel('Predicted Value')
axes[0, 2].set_ylabel('Loss')
axes[0, 2].set_title('Loss Function Comparison')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# Optimizer behavior simulation
# Simple 2D quadratic function: f(x,y) = x² + 10y²
def quadratic_function(x, y):
    return x**2 + 10*y**2

def quadratic_gradient(x, y):
    return np.array([2*x, 20*y])

# Test different optimizers
optimizers_2d = {
    'SGD': SGD(learning_rate=0.1),
    'SGD-Momentum': SGD(learning_rate=0.1, momentum=0.9),
    'Adam': Adam(learning_rate=0.3)
}

# Starting point
start_point = np.array([2.0, 1.0])
n_steps = 50

optimizer_paths = {}

for opt_name, optimizer in optimizers_2d.items():
    path = [start_point.copy()]
    current_point = start_point.copy()
    
    for step in range(n_steps):
        grad = quadratic_gradient(current_point[0], current_point[1])
        updated_params = optimizer.update([current_point], [grad])
        current_point = updated_params[0]
        path.append(current_point.copy())
    
    optimizer_paths[opt_name] = np.array(path)

# Plot optimization paths
x = np.linspace(-2.5, 2.5, 100)
y = np.linspace(-1.5, 1.5, 100)
X_mesh, Y_mesh = np.meshgrid(x, y)
Z = quadratic_function(X_mesh, Y_mesh)

contour = axes[1, 0].contour(X_mesh, Y_mesh, Z, levels=20, alpha=0.6)
axes[1, 0].clabel(contour, inline=True, fontsize=8)

colors = ['blue', 'red', 'green']
for i, (opt_name, path) in enumerate(optimizer_paths.items()):
    axes[1, 0].plot(path[:, 0], path[:, 1], 'o-', color=colors[i], 
                   label=opt_name, alpha=0.8, linewidth=2, markersize=4)

axes[1, 0].plot(0, 0, 'k*', markersize=15, label='Optimum')
axes[1, 0].set_xlabel('x')
axes[1, 0].set_ylabel('y')
axes[1, 0].set_title('Optimizer Paths on Quadratic Function')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Learning rate sensitivity
learning_rates = np.logspace(-3, 0, 20)
final_losses_sgd = []
final_losses_adam = []

for lr in learning_rates:
    # SGD
    trainer_sgd = NeuralNetworkTrainer(
        layer_sizes=[5, 10, 1],
        loss_function=MeanSquaredError(),
        optimizer=SGD(learning_rate=lr),
        activation='relu'
    )
    trainer_sgd.train(X_train_reg_scaled[:100], y_train_reg_scaled[:100], epochs=200)
    final_losses_sgd.append(trainer_sgd.loss_history[-1])
    
    # Adam
    trainer_adam = NeuralNetworkTrainer(
        layer_sizes=[5, 10, 1],
        loss_function=MeanSquaredError(),
        optimizer=Adam(learning_rate=lr),
        activation='relu'
    )
    trainer_adam.train(X_train_reg_scaled[:100], y_train_reg_scaled[:100], epochs=200)
    final_losses_adam.append(trainer_adam.loss_history[-1])

axes[1, 1].loglog(learning_rates, final_losses_sgd, 'o-', label='SGD', linewidth=2, markersize=6)
axes[1, 1].loglog(learning_rates, final_losses_adam, 's-', label='Adam', linewidth=2, markersize=6)
axes[1, 1].set_xlabel('Learning Rate')
axes[1, 1].set_ylabel('Final Loss')
axes[1, 1].set_title('Learning Rate Sensitivity')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# Performance vs network depth
depths = [1, 2, 3, 4, 5]
depth_performance = []

for depth in depths:
    hidden_sizes = [20] * depth
    layer_sizes = [5] + hidden_sizes + [1]
    
    trainer = NeuralNetworkTrainer(
        layer_sizes=layer_sizes,
        loss_function=MeanSquaredError(),
        optimizer=Adam(learning_rate=0.001),
        activation='relu'
    )
    
    trainer.train(X_train_reg_scaled, y_train_reg_scaled, epochs=300)
    
    y_pred = trainer.predict(X_test_reg_scaled)
    test_mse = mean_squared_error(y_test_reg_scaled, y_pred)
    depth_performance.append(test_mse)

axes[1, 2].plot(depths, depth_performance, 'o-', linewidth=2, markersize=8, color='purple')
axes[1, 2].set_xlabel('Network Depth (Hidden Layers)')
axes[1, 2].set_ylabel('Test MSE')
axes[1, 2].set_title('Performance vs Network Depth')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nOptimization Analysis:")
print(f"Adam generally converges faster than SGD")
print(f"Momentum helps SGD escape local minima and accelerate convergence")
print(f"Huber loss is more robust to outliers than MSE")
print(f"Learning rate sensitivity is lower for adaptive optimizers (Adam)")
best_config = min(training_results.keys(), key=lambda x: training_results[x]['test_mse'])
print(f"Best configuration: {best_config} with test MSE: {training_results[best_config]['test_mse']:.6f}")

## Summary and Key Takeaways

### Neural Networks Fundamentals:

1. **Activation Functions**:
   - **Sigmoid**: Output range [0,1], suffers from vanishing gradients (max derivative = 0.25)
   - **Tanh**: Output range [-1,1], zero-centered, still has vanishing gradient problem
   - **ReLU**: Solves vanishing gradients, computationally efficient, but can "die" (always output 0)
   - **Leaky ReLU**: Addresses dying ReLU problem with small negative slope

2. **Forward Propagation**:
   - Linear transformation: $\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$
   - Activation: $\mathbf{a}^{(l)} = f(\mathbf{z}^{(l)})$
   - Universal approximation theorem guarantees expressiveness with sufficient width

3. **Backpropagation Algorithm**:
   - Efficiently computes gradients using chain rule
   - Computational complexity: O(W) for both forward and backward passes
   - Gradient checking with numerical differentiation verifies implementation
   - Deeper networks face vanishing/exploding gradient problems

4. **Loss Functions**:
   - **MSE**: Smooth, differentiable, sensitive to outliers
   - **Cross-entropy**: Preferred for classification, probabilistic interpretation
   - **Huber loss**: Robust to outliers, combines MSE and MAE benefits
   - Choice affects convergence speed and robustness

5. **Optimization Methods**:
   - **SGD**: Simple, requires careful learning rate tuning
   - **SGD + Momentum**: Accelerates convergence, helps escape local minima
   - **Adam**: Adaptive learning rates, generally robust and fast converging
   - **Learning rate**: Critical hyperparameter affecting convergence

### Practical Guidelines:

**Architecture Design:**
- Start with ReLU activation for hidden layers
- Use appropriate output activation (sigmoid for binary, softmax for multiclass)
- Begin with 2-3 hidden layers, increase if needed
- Layer width: start with 2-10x input size

**Training Best Practices:**
- Initialize weights properly (Xavier/He initialization)
- Standardize input features
- Use Adam optimizer as default choice
- Monitor both training and validation loss
- Implement gradient checking for custom implementations

**Common Issues:**
- **Vanishing gradients**: Use ReLU, proper initialization, batch normalization
- **Exploding gradients**: Gradient clipping, lower learning rate
- **Overfitting**: Regularization, dropout, early stopping
- **Slow convergence**: Learning rate scheduling, momentum, adaptive optimizers

### Mathematical Insights:
- Backpropagation is automatic differentiation applied to neural networks
- Gradient descent finds local minima in non-convex loss landscapes
- Activation function choice critically affects gradient flow
- Optimization landscape becomes more complex with network depth
- Proper initialization and normalization are crucial for trainability