# TP1: Optimization of Machine Learning Problems

**Day 1 - AI for Sciences Winter School**

**Instructor:** Raphael Cousin

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/racousin/ai_for_sciences/blob/main/day1/tp1.ipynb)

## Objectives
1. Understand gradient descent intuitively
2. Apply gradient descent to linear regression with PyTorch
3. Compare CPU vs GPU training performance

## Setup

Run the cell below to install and import the required packages.

In [None]:
# Install the aiforscience package from GitHub
!pip install -q git+https://github.com/racousin/ai_for_sciences.git

import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

from aiforscience import (
    plot_gradient_descent_1d,
    plot_loss_history,
    plot_predictions,
    plot_gradient_step,
    print_model_params,
    print_training_step,
    print_gradient_info,
    print_device_comparison,
    generate_linear_data,
)

print("Setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

---
# Part 1: Understanding Gradient Descent

**Goal:** Find the value of $\theta$ that minimizes a function $f(\theta)$.

## The Key Idea

Gradient descent is an iterative optimization algorithm:

$$\theta_{new} = \theta_{old} - \eta \cdot \nabla f(\theta_{old})$$

Where:
- $\theta$ is the parameter we want to optimize
- $\eta$ (eta) is the **learning rate** (step size)
- $\nabla f(\theta)$ is the **gradient** (derivative) of the function

## Example Function

Let's minimize $f(\theta) = (3\theta - 7)^2$

**Question:** What is the analytical minimum of this function?

In [None]:
# Define the function and its gradient
def f(theta):
    """Function to minimize: (3*theta - 7)^2"""
    return (3 * theta - 7) ** 2

def gradient_f(theta):
    """Gradient of f: d/d_theta [(3*theta - 7)^2] = 2*(3*theta - 7)*3 = 6*(3*theta - 7)"""
    return 6 * (3 * theta - 7)

# Visualize the function
theta_values = np.linspace(-2, 5, 100)
plt.figure(figsize=(8, 4))
plt.plot(theta_values, [f(t) for t in theta_values], 'b-', linewidth=2)
plt.xlabel('θ')
plt.ylabel('f(θ)')
plt.title('f(θ) = (3θ - 7)²')
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
plt.show()

print(f"Analytical minimum: θ = 7/3 ≈ {7/3:.4f}")

## Gradient Descent Step by Step

Let's watch one gradient descent step in detail:

In [None]:
# Start at theta = 0
theta = 0.0
learning_rate = 0.05

# Compute gradient at current position
grad = gradient_f(theta)
print(f"Current θ = {theta}")
print(f"Current f(θ) = {f(theta)}")
print(f"Gradient at θ: {grad}")
print(f"\nUpdate: θ_new = θ - lr × gradient")
print(f"        θ_new = {theta} - {learning_rate} × {grad}")
print(f"        θ_new = {theta - learning_rate * grad}")

# Perform update
theta_new = theta - learning_rate * grad

# Visualize the step
plot_gradient_step(theta, theta_new, grad, learning_rate, f, theta_range=(-2, 5))

## Complete Gradient Descent

Now let's run multiple iterations:

In [None]:
def gradient_descent(f, gradient_f, theta_init, learning_rate, n_iterations):
    """
    Perform gradient descent optimization.
    
    Args:
        f: Function to minimize
        gradient_f: Gradient of f
        theta_init: Starting value
        learning_rate: Step size
        n_iterations: Number of steps
    
    Returns:
        theta_history: List of theta values at each step
    """
    theta = theta_init
    theta_history = [theta]
    
    for i in range(n_iterations):
        grad = gradient_f(theta)
        theta = theta - learning_rate * grad
        theta_history.append(theta)
        
    return theta_history

# Run gradient descent
theta_history = gradient_descent(
    f=f,
    gradient_f=gradient_f,
    theta_init=0.0,
    learning_rate=0.05,
    n_iterations=20
)

print(f"Starting θ: {theta_history[0]:.4f}")
print(f"Final θ: {theta_history[-1]:.4f}")
print(f"True minimum: {7/3:.4f}")

# Visualize
plot_gradient_descent_1d(f, theta_history, theta_range=(-2, 5))

---
## Exercise 1: Experiment with Learning Rate

Try different learning rates and observe the behavior:
- `learning_rate = 0.01` (small)
- `learning_rate = 0.05` (medium)
- `learning_rate = 0.15` (large)
- `learning_rate = 0.35` (too large?)

**Questions:**
1. What happens with a very small learning rate?
2. What happens with a very large learning rate?
3. What is a good learning rate for this problem?

In [None]:
# TODO: Try different learning rates
learning_rate = 0.05  # <-- Modify this value!

theta_history = gradient_descent(
    f=f,
    gradient_f=gradient_f,
    theta_init=0.0,
    learning_rate=learning_rate,
    n_iterations=20
)

print(f"Learning rate: {learning_rate}")
print(f"Final θ: {theta_history[-1]:.4f} (target: {7/3:.4f})")
plot_gradient_descent_1d(f, theta_history, theta_range=(-2, 5))

---
## Exercise 2: Try a Different Function

Modify the function to minimize. Try: $f(\theta) = \theta^2 + 5\theta + 6$

**Hint:** The gradient is $\nabla f(\theta) = 2\theta + 5$

In [None]:
# TODO: Define a new function and its gradient
def f2(theta):
    return theta**2 + 5*theta + 6  # Modify this!

def gradient_f2(theta):
    return 2*theta + 5  # Modify this!

# What's the analytical minimum? (hint: set gradient to 0)
analytical_min = -5/2  # theta where gradient = 0
print(f"Analytical minimum: θ = {analytical_min}")

# Run gradient descent
theta_history = gradient_descent(
    f=f2,
    gradient_f=gradient_f2,
    theta_init=5.0,
    learning_rate=0.1,
    n_iterations=30
)

print(f"Final θ: {theta_history[-1]:.4f}")
plot_gradient_descent_1d(f2, theta_history, theta_range=(-6, 6))

---
# Part 2: Linear Regression with PyTorch

Now we apply gradient descent to a real machine learning problem: **linear regression**.

**Goal:** Find weights $w$ and bias $b$ such that $\hat{y} = Xw + b$ minimizes the Mean Squared Error (MSE):

$$\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

## Generate a Dataset

In [None]:
# Generate synthetic data
X, y, true_weights, true_bias = generate_linear_data(
    n_samples=100,
    n_features=1,
    noise=0.5,
    seed=42
)

# Visualize the data
plt.figure(figsize=(8, 5))
plt.scatter(X.numpy(), y.numpy(), alpha=0.6)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Generated Data for Linear Regression')
plt.grid(True, alpha=0.3)
plt.show()

## Create a Linear Model

In [None]:
# Create a simple linear model: y = Xw + b
model = nn.Linear(in_features=1, out_features=1)

# Show initial (random) parameters
print_model_params(model, "Initial Model Parameters")

# Make initial predictions
with torch.no_grad():
    y_pred_initial = model(X)
    initial_loss = nn.MSELoss()(y_pred_initial, y)

print(f"Initial Loss (MSE): {initial_loss.item():.4f}")

# Visualize initial predictions
plot_predictions(X.numpy(), y.numpy(), y_pred_initial.numpy(), 
                title="Initial Predictions (Before Training)")

## Define Loss and Optimizer

In [None]:
# Reset model
model = nn.Linear(in_features=1, out_features=1)

# Loss function: Mean Squared Error
loss_fn = nn.MSELoss()

# Optimizer: Stochastic Gradient Descent
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

print(f"Loss function: MSE")
print(f"Optimizer: SGD with lr={learning_rate}")

## Training Step - Detailed View

Let's look at ONE training step in detail to understand what's happening:

In [None]:
print("="*60)
print(" DETAILED TRAINING STEP")
print("="*60)

# Step 0: Show current parameters
print("\n[BEFORE] Parameters:")
for name, param in model.named_parameters():
    print(f"  {name}: {param.data.numpy().flatten()}")

# Step 1: Forward pass - compute predictions
print("\n[STEP 1] Forward pass: y_pred = model(X)")
y_pred = model(X)
print(f"  Predictions shape: {y_pred.shape}")

# Step 2: Compute loss
print("\n[STEP 2] Compute loss: loss = MSE(y_pred, y)")
loss = loss_fn(y_pred, y)
print(f"  Loss value: {loss.item():.6f}")

# Step 3: Zero gradients (important!)
print("\n[STEP 3] Zero gradients: optimizer.zero_grad()")
optimizer.zero_grad()
print("  Gradients reset to zero")

# Step 4: Backward pass - compute gradients
print("\n[STEP 4] Backward pass: loss.backward()")
loss.backward()
print("  Gradients computed:")
for name, param in model.named_parameters():
    print(f"    {name}.grad: {param.grad.numpy().flatten()}")

# Step 5: Update parameters
print("\n[STEP 5] Update parameters: optimizer.step()")
print(f"  Rule: param_new = param_old - lr * gradient")
optimizer.step()

# Show updated parameters
print("\n[AFTER] Parameters:")
for name, param in model.named_parameters():
    print(f"  {name}: {param.data.numpy().flatten()}")

print("\n" + "="*60)

## Full Training Loop

In [None]:
# Reset model and optimizer
model = nn.Linear(in_features=1, out_features=1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Training parameters
n_epochs = 100
losses = []

# Training loop
for epoch in range(n_epochs):
    # Forward pass
    y_pred = model(X)
    loss = loss_fn(y_pred, y)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    
    # Print every 20 epochs
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1:3d}: Loss = {loss.item():.6f}")

# Final results
print("\n" + "="*40)
print(" TRAINING COMPLETE")
print("="*40)
print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss:   {losses[-1]:.4f}")

print(f"\nLearned parameters:")
for name, param in model.named_parameters():
    print(f"  {name}: {param.data.numpy().flatten()}")
print(f"\nTrue parameters:")
print(f"  weight: {true_weights.flatten()}")
print(f"  bias: {true_bias}")

In [None]:
# Visualize training progress
plot_loss_history(losses, title="Training Loss over Epochs")

# Visualize final predictions
with torch.no_grad():
    y_pred_final = model(X)
plot_predictions(X.numpy(), y.numpy(), y_pred_final.numpy(), 
                title="Final Predictions (After Training)")

---
## Exercise 3: Experiment with Training Parameters

Modify the code below to experiment with:
1. **Learning rate:** Try `0.01`, `0.1`, `0.5`, `1.0`
2. **Number of epochs:** Try `10`, `50`, `100`, `500`

**Questions:**
- What happens if the learning rate is too high?
- How many epochs are needed to converge?

In [None]:
# TODO: Modify these parameters
learning_rate = 0.1  # <-- Try different values!
n_epochs = 100       # <-- Try different values!

# Reset model
model = nn.Linear(in_features=1, out_features=1)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
losses = []

# Training loop
for epoch in range(n_epochs):
    y_pred = model(X)
    loss = loss_fn(y_pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

print(f"Learning rate: {learning_rate}, Epochs: {n_epochs}")
print(f"Final loss: {losses[-1]:.6f}")
plot_loss_history(losses)

---
## Exercise 4: Modify the Dataset

Try different dataset configurations:
- More samples: `n_samples=500`
- More noise: `noise=2.0`
- More features: `n_features=3`

In [None]:
# TODO: Modify dataset parameters
X_new, y_new, true_w, true_b = generate_linear_data(
    n_samples=100,   # <-- Try 500
    n_features=1,    # <-- Try 3
    noise=0.5,       # <-- Try 2.0
    seed=42
)

# Create and train model
n_features = X_new.shape[1]
model = nn.Linear(in_features=n_features, out_features=1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
losses = []

for epoch in range(100):
    y_pred = model(X_new)
    loss = loss_fn(y_pred, y_new)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

print(f"Final loss: {losses[-1]:.6f}")
plot_loss_history(losses)

---
## Exercise 5: Build a Neural Network

Let's add more layers! A neural network is just multiple linear layers with **non-linear activation functions** between them.

Architecture:
- Input layer: `n_features` neurons
- Hidden layer 1: 16 neurons + ReLU activation
- Hidden layer 2: 8 neurons + ReLU activation
- Output layer: 1 neuron

In [None]:
# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self, n_features):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(n_features, 16),  # Hidden layer 1
            nn.ReLU(),                   # Activation
            nn.Linear(16, 8),            # Hidden layer 2
            nn.ReLU(),                   # Activation
            nn.Linear(8, 1)              # Output layer
        )
    
    def forward(self, x):
        return self.layers(x)

# Create model
nn_model = SimpleNN(n_features=1)

# Count parameters
total_params = sum(p.numel() for p in nn_model.parameters())
print(f"Neural Network Architecture:")
print(nn_model)
print(f"\nTotal parameters: {total_params}")

In [None]:
# Train the neural network
nn_model = SimpleNN(n_features=1)
optimizer = torch.optim.SGD(nn_model.parameters(), lr=0.01)
losses = []

for epoch in range(500):
    y_pred = nn_model(X)
    loss = loss_fn(y_pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    losses.append(loss.item())
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}: Loss = {loss.item():.6f}")

plot_loss_history(losses, title="Neural Network Training Loss")

with torch.no_grad():
    y_pred_nn = nn_model(X)
plot_predictions(X.numpy(), y.numpy(), y_pred_nn.numpy(), 
                title="Neural Network Predictions")

---
# Part 3: CPU vs GPU Performance

Deep learning benefits enormously from GPU acceleration. Let's compare training speed on CPU vs GPU.

## Setting Up GPU in Colab

To use GPU in Google Colab:
1. Go to **Runtime** > **Change runtime type**
2. Select **GPU** as the Hardware accelerator
3. Click **Save**

In [None]:
# Check available devices
print("Device Information:")
print(f"  CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  GPU name: {torch.cuda.get_device_name(0)}")
    print(f"  GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## Benchmark Function

We'll train the same model on CPU and GPU and compare the time.

In [None]:
import time

def train_model(X, y, device, n_epochs=1000, hidden_size=128, n_layers=4, verbose=True):
    """
    Train a neural network and measure time.
    
    Args:
        X, y: Training data
        device: 'cpu' or 'cuda'
        n_epochs: Number of training epochs
        hidden_size: Neurons per hidden layer
        n_layers: Number of hidden layers
        verbose: Print progress
    """
    # Move data to device
    X_dev = X.to(device)
    y_dev = y.to(device)
    
    # Build model
    layers = []
    in_features = X.shape[1]
    for i in range(n_layers):
        layers.append(nn.Linear(in_features, hidden_size))
        layers.append(nn.ReLU())
        in_features = hidden_size
    layers.append(nn.Linear(hidden_size, 1))
    
    model = nn.Sequential(*layers).to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    loss_fn = nn.MSELoss()
    
    # Count parameters
    n_params = sum(p.numel() for p in model.parameters())
    if verbose:
        print(f"Model: {n_layers} hidden layers, {hidden_size} neurons each")
        print(f"Total parameters: {n_params:,}")
        print(f"Device: {device}")
    
    # Warm-up (for GPU)
    if device == 'cuda':
        for _ in range(10):
            y_pred = model(X_dev)
            loss = loss_fn(y_pred, y_dev)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        torch.cuda.synchronize()
    
    # Timed training
    start_time = time.time()
    
    for epoch in range(n_epochs):
        y_pred = model(X_dev)
        loss = loss_fn(y_pred, y_dev)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    # Ensure GPU operations are complete
    if device == 'cuda':
        torch.cuda.synchronize()
    
    elapsed_time = time.time() - start_time
    
    if verbose:
        print(f"Training time: {elapsed_time:.4f} seconds")
        print(f"Final loss: {loss.item():.6f}")
    
    return elapsed_time, loss.item()

## Run the Benchmark

In [None]:
# Generate larger dataset for meaningful comparison
X_large, y_large, _, _ = generate_linear_data(
    n_samples=5000,
    n_features=20,
    noise=0.5,
    seed=42
)

print("\n" + "="*50)
print(" TRAINING ON CPU")
print("="*50)
cpu_time, cpu_loss = train_model(
    X_large, y_large, 
    device='cpu', 
    n_epochs=500,
    hidden_size=256,
    n_layers=4
)

if torch.cuda.is_available():
    print("\n" + "="*50)
    print(" TRAINING ON GPU")
    print("="*50)
    gpu_time, gpu_loss = train_model(
        X_large, y_large, 
        device='cuda', 
        n_epochs=500,
        hidden_size=256,
        n_layers=4
    )
    
    # Print comparison
    print_device_comparison(cpu_time, gpu_time)
else:
    print("\n" + "="*50)
    print(" GPU NOT AVAILABLE")
    print("="*50)
    print("To enable GPU in Colab:")
    print("  Runtime > Change runtime type > GPU")

---
## Exercise 6: Scale Up the Model

Try increasing the model size to see bigger GPU speedups:
- More data: `n_samples=10000`
- More features: `n_features=50`
- Bigger network: `hidden_size=512`, `n_layers=6`
- More epochs: `n_epochs=1000`

In [None]:
# TODO: Modify these parameters to see larger speedups
n_samples = 5000      # <-- Try 10000, 20000
n_features = 20       # <-- Try 50, 100
hidden_size = 256     # <-- Try 512, 1024
n_layers = 4          # <-- Try 6, 8
n_epochs = 500        # <-- Try 1000

X_test, y_test, _, _ = generate_linear_data(
    n_samples=n_samples,
    n_features=n_features,
    noise=0.5,
    seed=42
)

print("\nCPU Training:")
cpu_time, _ = train_model(
    X_test, y_test, 'cpu', n_epochs=n_epochs,
    hidden_size=hidden_size, n_layers=n_layers
)

if torch.cuda.is_available():
    print("\nGPU Training:")
    gpu_time, _ = train_model(
        X_test, y_test, 'cuda', n_epochs=n_epochs,
        hidden_size=hidden_size, n_layers=n_layers
    )
    print_device_comparison(cpu_time, gpu_time)

---
# Summary

In this practical, you learned:

1. **Gradient Descent**: The core optimization algorithm
   - $\theta_{new} = \theta_{old} - \eta \cdot \nabla f(\theta)$
   - Learning rate controls step size

2. **PyTorch Training Loop**:
   - Forward pass: `y_pred = model(X)`
   - Compute loss: `loss = loss_fn(y_pred, y)`
   - Zero gradients: `optimizer.zero_grad()`
   - Backward pass: `loss.backward()`
   - Update parameters: `optimizer.step()`

3. **CPU vs GPU**: GPUs can dramatically accelerate deep learning training

## Key Takeaways

- **Learning rate** is crucial: too small = slow, too large = unstable
- **Epochs** determine how long we train
- **GPUs** are essential for large-scale deep learning
- **Neural networks** are compositions of linear layers + activations