# Module 1 - Exercise 2: Mathematical Implementation

## Learning Objectives
- Implement gradient descent from scratch using NumPy and PyTorch tensors
- Understand optimization trajectories and convergence behavior
- Compare different optimization algorithms (SGD, momentum, adaptive learning rates)
- Visualize mathematical optimization concepts
- Implement key mathematical functions for machine learning

## Prerequisites
- Completion of Exercise 1: Environment & Basics
- Basic calculus knowledge (derivatives, gradients)
- Understanding of optimization concepts
- Familiarity with NumPy arrays and mathematical operations

## Setup and Test Repository

First, let's clone the test repository and set up our environment for step-by-step validation.

In [None]:
# Clone the test repository
!git clone https://github.com/racousin/data_science_practice.git /tmp/tests 2>/dev/null || true
!cd /tmp/tests && pwd && ls -la tests/python_deep_learning/module1/

# Import the test module
import sys
sys.path.append('/tmp/tests')
print("Test repository setup complete!")

## Environment Setup

Import necessary libraries for mathematical implementations and visualizations.

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import math
from typing import Callable, Tuple, List

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Import test functions
from tests.python_deep_learning.module1.test_exercise2 import *

print(f"PyTorch version: {torch.__version__}")
print(f"NumPy version: {np.__version__}")

## Section 1: Gradient Descent from Scratch with NumPy

Implement basic gradient descent using only NumPy to understand the fundamental mathematics.

In [None]:
def quadratic_function(x: np.ndarray) -> float:
    """Simple quadratic function: f(x) = (x[0]-3)^2 + (x[1]-2)^2"""
    return (x[0] - 3)**2 + (x[1] - 2)**2

def quadratic_gradient(x: np.ndarray) -> np.ndarray:
    """Gradient of the quadratic function"""
    return np.array([2*(x[0] - 3), 2*(x[1] - 2)])

def rosenbrock_function(x: np.ndarray) -> float:
    """Rosenbrock function: f(x) = (1-x[0])^2 + 100*(x[1]-x[0]^2)^2"""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_gradient(x: np.ndarray) -> np.ndarray:
    """Gradient of the Rosenbrock function"""
    dx0 = -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2)
    dx1 = 200*(x[1] - x[0]**2)
    return np.array([dx0, dx1])

# TODO: Implement gradient descent with NumPy
def gradient_descent_numpy(func: Callable, grad_func: Callable, 
                          initial_point: np.ndarray, learning_rate: float, 
                          num_iterations: int) -> Tuple[np.ndarray, List[np.ndarray], List[float]]:
    """
    Implement gradient descent using NumPy.
    
    Args:
        func: The function to minimize
        grad_func: The gradient function
        initial_point: Starting point for optimization
        learning_rate: Step size for updates
        num_iterations: Number of optimization steps
    
    Returns:
        final_point: Final optimized point
        trajectory: List of points visited during optimization
        losses: List of function values at each step
    """
    # TODO: Initialize variables
    current_point = None
    trajectory = []
    losses = []
    
    # TODO: Implement the gradient descent loop
    for i in range(num_iterations):
        # Calculate current loss and gradient
        current_loss = None
        current_gradient = None
        
        # Store trajectory and loss
        trajectory.append(current_point.copy())
        losses.append(current_loss)
        
        # Update point using gradient descent rule
        current_point = None  # current_point - learning_rate * gradient
    
    return current_point, trajectory, losses

# Test your implementation
initial_point = np.array([0.0, 0.0])
final_point, trajectory, losses = gradient_descent_numpy(
    quadratic_function, quadratic_gradient, initial_point, 0.1, 50
)

print(f"Initial point: {initial_point}")
print(f"Final point: {final_point}")
print(f"Final loss: {losses[-1] if losses else 'No losses recorded'}")
print(f"Expected minimum: [3.0, 2.0] with loss 0.0")

In [None]:
# Test your NumPy gradient descent implementation
try:
    test_numpy_gradient_descent(locals())
    print("✅ Section 1: NumPy Gradient Descent - All tests passed!")
except Exception as e:
    print(f"❌ Section 1: NumPy Gradient Descent - Tests failed: {e}")
    print("Please complete the gradient descent implementation above before proceeding.")

## Section 2: Gradient Descent with PyTorch Tensors (No Autograd)

Implement gradient descent using PyTorch tensors but without autograd to understand manual differentiation.

In [None]:
def quadratic_function_torch(x: torch.Tensor) -> torch.Tensor:
    """Quadratic function using PyTorch tensors"""
    return (x[0] - 3)**2 + (x[1] - 2)**2

def quadratic_gradient_torch(x: torch.Tensor) -> torch.Tensor:
    """Gradient of quadratic function using PyTorch tensors"""
    return torch.tensor([2*(x[0] - 3), 2*(x[1] - 2)], dtype=x.dtype)

def rosenbrock_function_torch(x: torch.Tensor) -> torch.Tensor:
    """Rosenbrock function using PyTorch tensors"""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_gradient_torch(x: torch.Tensor) -> torch.Tensor:
    """Gradient of Rosenbrock function using PyTorch tensors"""
    dx0 = -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2)
    dx1 = 200*(x[1] - x[0]**2)
    return torch.tensor([dx0, dx1], dtype=x.dtype)

# TODO: Implement gradient descent with PyTorch tensors
def gradient_descent_torch(func: Callable, grad_func: Callable, 
                          initial_point: torch.Tensor, learning_rate: float, 
                          num_iterations: int) -> Tuple[torch.Tensor, List[torch.Tensor], List[float]]:
    """
    Implement gradient descent using PyTorch tensors (without autograd).
    
    Args:
        func: The function to minimize
        grad_func: The gradient function  
        initial_point: Starting point for optimization
        learning_rate: Step size for updates
        num_iterations: Number of optimization steps
        
    Returns:
        final_point: Final optimized point
        trajectory: List of points visited during optimization
        losses: List of function values at each step
    """
    # TODO: Initialize variables
    current_point = None
    trajectory = []
    losses = []
    
    # TODO: Implement the gradient descent loop
    for i in range(num_iterations):
        # Calculate current loss and gradient
        current_loss = None
        current_gradient = None
        
        # Store trajectory and loss
        trajectory.append(current_point.clone())
        losses.append(current_loss.item())
        
        # Update point using gradient descent rule
        current_point = None  # current_point - learning_rate * gradient
    
    return current_point, trajectory, losses

# Test your PyTorch implementation
initial_point_torch = torch.tensor([0.0, 0.0], dtype=torch.float32)
final_point_torch, trajectory_torch, losses_torch = gradient_descent_torch(
    quadratic_function_torch, quadratic_gradient_torch, initial_point_torch, 0.1, 50
)

print(f"Initial point: {initial_point_torch}")
print(f"Final point: {final_point_torch}")
print(f"Final loss: {losses_torch[-1] if losses_torch else 'No losses recorded'}")
print(f"Expected minimum: [3.0, 2.0] with loss 0.0")

In [None]:
# Test your PyTorch gradient descent implementation
try:
    test_torch_gradient_descent(locals())
    print("✅ Section 2: PyTorch Gradient Descent - All tests passed!")
except Exception as e:
    print(f"❌ Section 2: PyTorch Gradient Descent - Tests failed: {e}")
    print("Please complete the PyTorch gradient descent implementation above before proceeding.")

## Section 3: Optimization Trajectory Visualization

Visualize how gradient descent navigates the optimization landscape.

In [None]:
# TODO: Create visualization function for 2D optimization
def plot_optimization_trajectory(func, trajectory, title="Optimization Trajectory"):
    """
    Plot the optimization trajectory on a 2D contour plot.
    
    Args:
        func: The function being optimized
        trajectory: List of points visited during optimization
        title: Plot title
    """
    # TODO: Create a grid for contour plot
    if len(trajectory) == 0:
        print("No trajectory to plot")
        return
    
    # Convert trajectory to numpy if it's torch tensors
    if isinstance(trajectory[0], torch.Tensor):
        trajectory_np = [point.numpy() for point in trajectory]
    else:
        trajectory_np = trajectory
    
    # Create grid for contour plot
    trajectory_array = np.array(trajectory_np)
    x_min, x_max = trajectory_array[:, 0].min() - 1, trajectory_array[:, 0].max() + 1
    y_min, y_max = trajectory_array[:, 1].min() - 1, trajectory_array[:, 1].max() + 1
    
    x = np.linspace(x_min, x_max, 100)
    y = np.linspace(y_min, y_max, 100)
    X, Y = np.meshgrid(x, y)
    
    # TODO: Evaluate function on grid
    Z = None  # Apply func to each point in the grid
    
    # TODO: Create the plot
    plt.figure(figsize=(10, 8))
    
    # Plot contour
    contour = plt.contour(X, Y, Z, levels=20, alpha=0.6)
    plt.contourf(X, Y, Z, levels=20, alpha=0.3)
    plt.colorbar(label='Function Value')
    
    # Plot trajectory
    trajectory_x = [point[0] for point in trajectory_np]
    trajectory_y = [point[1] for point in trajectory_np]
    
    plt.plot(trajectory_x, trajectory_y, 'ro-', linewidth=2, markersize=6, label='Optimization Path')
    plt.plot(trajectory_x[0], trajectory_y[0], 'gs', markersize=10, label='Start')
    plt.plot(trajectory_x[-1], trajectory_y[-1], 'r^', markersize=10, label='End')
    
    plt.xlabel('x[0]')
    plt.ylabel('x[1]')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Visualize trajectories for different functions
if 'trajectory' in locals() and trajectory:
    plot_optimization_trajectory(quadratic_function, trajectory, "Quadratic Function Optimization")

# Test on Rosenbrock function
rosenbrock_result = gradient_descent_numpy(
    rosenbrock_function, rosenbrock_gradient, 
    np.array([-1.0, 1.0]), 0.001, 1000
)
rosenbrock_final, rosenbrock_trajectory, rosenbrock_losses = rosenbrock_result

if rosenbrock_trajectory:
    plot_optimization_trajectory(rosenbrock_function, rosenbrock_trajectory, "Rosenbrock Function Optimization")
    print(f"Rosenbrock final point: {rosenbrock_final}")
    print(f"Rosenbrock final loss: {rosenbrock_losses[-1]}")
    print(f"Expected minimum: [1.0, 1.0] with loss 0.0")

In [None]:
# Test your visualization implementation
try:
    test_visualization(locals())
    print("✅ Section 3: Optimization Visualization - All tests passed!")
except Exception as e:
    print(f"❌ Section 3: Optimization Visualization - Tests failed: {e}")
    print("Please complete the visualization implementation above before proceeding.")

## Section 4: SGD with Momentum

Implement SGD with momentum to improve convergence behavior.

In [None]:
# TODO: Implement SGD with momentum
def sgd_with_momentum(func: Callable, grad_func: Callable, 
                     initial_point: np.ndarray, learning_rate: float, 
                     momentum: float, num_iterations: int) -> Tuple[np.ndarray, List[np.ndarray], List[float]]:
    """
    Implement SGD with momentum.
    
    Args:
        func: The function to minimize
        grad_func: The gradient function
        initial_point: Starting point for optimization
        learning_rate: Step size for updates
        momentum: Momentum coefficient (typically 0.9)
        num_iterations: Number of optimization steps
        
    Returns:
        final_point: Final optimized point
        trajectory: List of points visited during optimization
        losses: List of function values at each step
    """
    # TODO: Initialize variables
    current_point = None
    velocity = None  # Initialize momentum velocity
    trajectory = []
    losses = []
    
    # TODO: Implement the SGD with momentum loop
    for i in range(num_iterations):
        # Calculate current loss and gradient
        current_loss = None
        current_gradient = None
        
        # Store trajectory and loss
        trajectory.append(current_point.copy())
        losses.append(current_loss)
        
        # Update velocity with momentum
        velocity = None  # momentum * velocity + learning_rate * gradient
        
        # Update point using velocity
        current_point = None  # current_point - velocity
    
    return current_point, trajectory, losses

# Compare regular SGD vs SGD with momentum
initial_point = np.array([-1.0, 1.0])

# Regular SGD
sgd_result = gradient_descent_numpy(
    rosenbrock_function, rosenbrock_gradient, 
    initial_point.copy(), 0.001, 500
)
sgd_final, sgd_trajectory, sgd_losses = sgd_result

# SGD with momentum
momentum_result = sgd_with_momentum(
    rosenbrock_function, rosenbrock_gradient,
    initial_point.copy(), 0.001, 0.9, 500
)
momentum_final, momentum_trajectory, momentum_losses = momentum_result

print(f"Regular SGD final point: {sgd_final}")
print(f"Regular SGD final loss: {sgd_losses[-1] if sgd_losses else 'No losses'}")
print(f"SGD with momentum final point: {momentum_final}")
print(f"SGD with momentum final loss: {momentum_losses[-1] if momentum_losses else 'No losses'}")

# Plot loss curves
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
if sgd_losses and momentum_losses:
    plt.plot(sgd_losses, label='Regular SGD', alpha=0.7)
    plt.plot(momentum_losses, label='SGD with Momentum', alpha=0.7)
    plt.xlabel('Iteration')
    plt.ylabel('Loss')
    plt.title('Loss Comparison')
    plt.legend()
    plt.yscale('log')
    plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
if sgd_losses and momentum_losses:
    plt.plot(sgd_losses[-100:], label='Regular SGD (last 100)', alpha=0.7)
    plt.plot(momentum_losses[-100:], label='SGD with Momentum (last 100)', alpha=0.7)
    plt.xlabel('Iteration')
    plt.ylabel('Loss')
    plt.title('Loss Comparison (Last 100 iterations)')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Test your momentum implementation
try:
    test_sgd_momentum(locals())
    print("✅ Section 4: SGD with Momentum - All tests passed!")
except Exception as e:
    print(f"❌ Section 4: SGD with Momentum - Tests failed: {e}")
    print("Please complete the SGD with momentum implementation above before proceeding.")

## Section 5: Adaptive Learning Rate Methods

Implement adaptive learning rate methods like AdaGrad and RMSprop.

In [None]:
# TODO: Implement AdaGrad optimizer
def adagrad_optimizer(func: Callable, grad_func: Callable, 
                     initial_point: np.ndarray, learning_rate: float, 
                     epsilon: float, num_iterations: int) -> Tuple[np.ndarray, List[np.ndarray], List[float]]:
    """
    Implement AdaGrad optimizer.
    
    Args:
        func: The function to minimize
        grad_func: The gradient function
        initial_point: Starting point for optimization
        learning_rate: Base learning rate
        epsilon: Small constant for numerical stability
        num_iterations: Number of optimization steps
        
    Returns:
        final_point: Final optimized point
        trajectory: List of points visited during optimization
        losses: List of function values at each step
    """
    # TODO: Initialize variables
    current_point = None
    sum_squared_gradients = None  # Accumulated squared gradients
    trajectory = []
    losses = []
    
    # TODO: Implement the AdaGrad loop
    for i in range(num_iterations):
        # Calculate current loss and gradient
        current_loss = None
        current_gradient = None
        
        # Store trajectory and loss
        trajectory.append(current_point.copy())
        losses.append(current_loss)
        
        # Update sum of squared gradients
        sum_squared_gradients = None  # sum_squared_gradients + gradient^2
        
        # Compute adaptive learning rate
        adaptive_lr = None  # learning_rate / sqrt(sum_squared_gradients + epsilon)
        
        # Update point
        current_point = None  # current_point - adaptive_lr * gradient
    
    return current_point, trajectory, losses

# TODO: Implement RMSprop optimizer
def rmsprop_optimizer(func: Callable, grad_func: Callable, 
                     initial_point: np.ndarray, learning_rate: float, 
                     beta: float, epsilon: float, num_iterations: int) -> Tuple[np.ndarray, List[np.ndarray], List[float]]:
    """
    Implement RMSprop optimizer.
    
    Args:
        func: The function to minimize
        grad_func: The gradient function
        initial_point: Starting point for optimization
        learning_rate: Base learning rate
        beta: Exponential decay rate for moving average
        epsilon: Small constant for numerical stability
        num_iterations: Number of optimization steps
        
    Returns:
        final_point: Final optimized point
        trajectory: List of points visited during optimization
        losses: List of function values at each step
    """
    # TODO: Initialize variables
    current_point = None
    moving_avg_squared_grad = None  # Moving average of squared gradients
    trajectory = []
    losses = []
    
    # TODO: Implement the RMSprop loop
    for i in range(num_iterations):
        # Calculate current loss and gradient
        current_loss = None
        current_gradient = None
        
        # Store trajectory and loss
        trajectory.append(current_point.copy())
        losses.append(current_loss)
        
        # Update moving average of squared gradients
        moving_avg_squared_grad = None  # beta * moving_avg + (1-beta) * gradient^2
        
        # Compute adaptive learning rate
        adaptive_lr = None  # learning_rate / sqrt(moving_avg_squared_grad + epsilon)
        
        # Update point
        current_point = None  # current_point - adaptive_lr * gradient
    
    return current_point, trajectory, losses

# Compare different optimizers
initial_point = np.array([-1.0, 1.0])
num_iters = 500

# Test AdaGrad
adagrad_result = adagrad_optimizer(
    rosenbrock_function, rosenbrock_gradient,
    initial_point.copy(), 0.1, 1e-8, num_iters
)
adagrad_final, adagrad_trajectory, adagrad_losses = adagrad_result

# Test RMSprop
rmsprop_result = rmsprop_optimizer(
    rosenbrock_function, rosenbrock_gradient,
    initial_point.copy(), 0.01, 0.9, 1e-8, num_iters
)
rmsprop_final, rmsprop_trajectory, rmsprop_losses = rmsprop_result

print(f"AdaGrad final point: {adagrad_final}")
print(f"AdaGrad final loss: {adagrad_losses[-1] if adagrad_losses else 'No losses'}")
print(f"RMSprop final point: {rmsprop_final}")
print(f"RMSprop final loss: {rmsprop_losses[-1] if rmsprop_losses else 'No losses'}")

# Plot comparison of all methods
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
optimizers = [
    ('SGD', sgd_losses),
    ('SGD+Momentum', momentum_losses),
    ('AdaGrad', adagrad_losses),
    ('RMSprop', rmsprop_losses)
]

for name, losses in optimizers:
    if losses:
        plt.plot(losses, label=name, alpha=0.8)

plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Optimizer Comparison - Full Training')
plt.legend()
plt.yscale('log')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
for name, losses in optimizers:
    if losses and len(losses) > 100:
        plt.plot(losses[-100:], label=name, alpha=0.8)

plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Optimizer Comparison - Last 100 iterations')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
trajectories = [
    ('SGD', sgd_trajectory),
    ('SGD+Momentum', momentum_trajectory),
    ('AdaGrad', adagrad_trajectory),
    ('RMSprop', rmsprop_trajectory)
]

for name, trajectory in trajectories:
    if trajectory:
        traj_array = np.array(trajectory)
        plt.plot(traj_array[:, 0], traj_array[:, 1], label=name, alpha=0.8)

plt.plot(1.0, 1.0, 'r*', markersize=15, label='Global Minimum')
plt.xlabel('x[0]')
plt.ylabel('x[1]')
plt.title('Optimization Paths')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Test your adaptive learning rate implementations
try:
    test_adaptive_optimizers(locals())
    print("✅ Section 5: Adaptive Learning Rate Methods - All tests passed!")
except Exception as e:
    print(f"❌ Section 5: Adaptive Learning Rate Methods - Tests failed: {e}")
    print("Please complete the adaptive optimizer implementations above before proceeding.")

## Section 6: Learning Rate Schedules

Implement different learning rate scheduling strategies.

In [None]:
# TODO: Implement learning rate schedules
def step_schedule(initial_lr: float, step_size: int, gamma: float, epoch: int) -> float:
    """Step decay schedule: lr = initial_lr * gamma^(epoch // step_size)"""
    # TODO: Implement step decay
    return None

def exponential_schedule(initial_lr: float, gamma: float, epoch: int) -> float:
    """Exponential decay schedule: lr = initial_lr * gamma^epoch"""
    # TODO: Implement exponential decay
    return None

def cosine_schedule(initial_lr: float, max_epochs: int, epoch: int) -> float:
    """Cosine annealing schedule"""
    # TODO: Implement cosine annealing
    return None

# TODO: Implement SGD with learning rate schedule
def sgd_with_schedule(func: Callable, grad_func: Callable, 
                     initial_point: np.ndarray, initial_lr: float,
                     schedule_func: Callable, num_iterations: int) -> Tuple[np.ndarray, List[np.ndarray], List[float], List[float]]:
    """
    Implement SGD with learning rate scheduling.
    
    Args:
        func: The function to minimize
        grad_func: The gradient function
        initial_point: Starting point for optimization
        initial_lr: Initial learning rate
        schedule_func: Function that returns learning rate given epoch
        num_iterations: Number of optimization steps
        
    Returns:
        final_point: Final optimized point
        trajectory: List of points visited during optimization
        losses: List of function values at each step
        learning_rates: List of learning rates used at each step
    """
    # TODO: Initialize variables
    current_point = None
    trajectory = []
    losses = []
    learning_rates = []
    
    # TODO: Implement the SGD with schedule loop
    for i in range(num_iterations):
        # Get current learning rate from schedule
        current_lr = None  # schedule_func(i)
        
        # Calculate current loss and gradient
        current_loss = None
        current_gradient = None
        
        # Store trajectory, loss, and learning rate
        trajectory.append(current_point.copy())
        losses.append(current_loss)
        learning_rates.append(current_lr)
        
        # Update point
        current_point = None  # current_point - current_lr * gradient
    
    return current_point, trajectory, losses, learning_rates

# Test different schedules
initial_point = np.array([-1.0, 1.0])
num_iters = 500

# Step schedule
step_sched = lambda epoch: step_schedule(0.01, 100, 0.5, epoch)
step_result = sgd_with_schedule(
    rosenbrock_function, rosenbrock_gradient,
    initial_point.copy(), 0.01, step_sched, num_iters
)
step_final, step_trajectory, step_losses, step_lrs = step_result

# Exponential schedule
exp_sched = lambda epoch: exponential_schedule(0.01, 0.999, epoch)
exp_result = sgd_with_schedule(
    rosenbrock_function, rosenbrock_gradient,
    initial_point.copy(), 0.01, exp_sched, num_iters
)
exp_final, exp_trajectory, exp_losses, exp_lrs = exp_result

# Cosine schedule
cos_sched = lambda epoch: cosine_schedule(0.01, num_iters, epoch)
cos_result = sgd_with_schedule(
    rosenbrock_function, rosenbrock_gradient,
    initial_point.copy(), 0.01, cos_sched, num_iters
)
cos_final, cos_trajectory, cos_losses, cos_lrs = cos_result

print(f"Step schedule final point: {step_final}, loss: {step_losses[-1] if step_losses else 'N/A'}")
print(f"Exponential schedule final point: {exp_final}, loss: {exp_losses[-1] if exp_losses else 'N/A'}")
print(f"Cosine schedule final point: {cos_final}, loss: {cos_losses[-1] if cos_losses else 'N/A'}")

# Plot learning rate schedules and losses
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
schedules = [
    ('Step', step_lrs),
    ('Exponential', exp_lrs),
    ('Cosine', cos_lrs)
]

for name, lrs in schedules:
    if lrs:
        plt.plot(lrs, label=name)

plt.xlabel('Iteration')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
losses_schedules = [
    ('Step', step_losses),
    ('Exponential', exp_losses),
    ('Cosine', cos_losses)
]

for name, losses in losses_schedules:
    if losses:
        plt.plot(losses, label=name)

plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss with Different Schedules')
plt.legend()
plt.yscale('log')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
traj_schedules = [
    ('Step', step_trajectory),
    ('Exponential', exp_trajectory),
    ('Cosine', cos_trajectory)
]

for name, trajectory in traj_schedules:
    if trajectory:
        traj_array = np.array(trajectory)
        plt.plot(traj_array[:, 0], traj_array[:, 1], label=name)

plt.plot(1.0, 1.0, 'r*', markersize=15, label='Global Minimum')
plt.xlabel('x[0]')
plt.ylabel('x[1]')
plt.title('Optimization Paths with Schedules')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Test your learning rate schedule implementations
try:
    test_learning_rate_schedules(locals())
    print("✅ Section 6: Learning Rate Schedules - All tests passed!")
except Exception as e:
    print(f"❌ Section 6: Learning Rate Schedules - Tests failed: {e}")
    print("Please complete the learning rate schedule implementations above before proceeding.")

## Final Validation

Run the complete test suite to validate all your mathematical implementations.

In [None]:
# Run complete validation
print("Running complete test suite...\n")

all_tests_passed = True
test_sections = [
    ("NumPy Gradient Descent", test_numpy_gradient_descent),
    ("PyTorch Gradient Descent", test_torch_gradient_descent),
    ("Optimization Visualization", test_visualization),
    ("SGD with Momentum", test_sgd_momentum),
    ("Adaptive Optimizers", test_adaptive_optimizers),
    ("Learning Rate Schedules", test_learning_rate_schedules)
]

for section_name, test_func in test_sections:
    try:
        test_func(locals())
        print(f"✅ {section_name} - PASSED")
    except Exception as e:
        print(f"❌ {section_name} - FAILED: {e}")
        all_tests_passed = False

print("\n" + "="*50)
if all_tests_passed:
    print("🎉 ALL TESTS PASSED! You have successfully completed Exercise 2.")
    print("You are now ready to proceed to Exercise 3: Tensor Mastery.")
else:
    print("❌ Some tests failed. Please review the failed sections and complete the missing implementations.")
print("="*50)

## Summary

In this exercise, you have implemented core mathematical foundations for optimization:

1. **Gradient Descent from Scratch**: Understanding the fundamental mathematics behind optimization
2. **NumPy vs PyTorch Implementation**: Comparing different computational backends
3. **Optimization Visualization**: Seeing how algorithms navigate the loss landscape
4. **Momentum Methods**: Accelerating convergence with momentum
5. **Adaptive Learning Rates**: AdaGrad and RMSprop for automatic learning rate adjustment
6. **Learning Rate Schedules**: Dynamic learning rate adjustment strategies

These mathematical implementations provide the foundation for understanding how modern deep learning optimizers work under the hood. You've gained practical experience with:

- Manual gradient computation and parameter updates
- Different optimization strategies and their trade-offs
- Convergence analysis and visualization techniques
- The mathematical principles underlying popular optimization algorithms

This knowledge will be crucial when working with PyTorch's built-in optimizers and understanding their behavior in neural network training.