# 03: Calculus for Machine Learning

Welcome to the mathematical engine of machine learning! Today we'll explore how calculus powers the learning process in ML models through optimization and gradient-based methods.

> 💡 **Companion Reading**: This notebook pairs with [03_calculus_for_ml.md](03_calculus_for_ml.md) for deeper mathematical insights, analogies, and tutor guidance.

## 🎯 Objectives
- Understand derivatives and gradients as tools for measuring change
- Apply the chain rule to multivariable functions (essential for neural networks)
- Visualize and implement gradient descent optimization
- Connect calculus concepts to real machine learning applications
- Build intuition for how models "learn" through mathematical optimization

## 📐 Derivatives & Cost Functions

The derivative tells us how fast a function is changing at any point. In machine learning, we use derivatives to find the minimum of cost functions - the point where our model performs best.

**Key Insight**: The derivative is zero at the minimum of a function!


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define a quadratic cost function (like mean squared error)
def f(x): return (x - 3)**2
def df(x): return 2 * (x - 3)

x = np.linspace(-1, 7, 100)
y = f(x)
dy = df(x)

plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.plot(x, y, 'b-', linewidth=2, label='f(x) = (x-3)²')
plt.axvline(3, color='red', linestyle=':', label='Minimum at x=3')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title("Cost Function")
plt.legend()
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(x, dy, 'r--', linewidth=2, label="f'(x) = 2(x-3)")
plt.axhline(0, color='gray', linestyle='-', alpha=0.5)
plt.axvline(3, color='red', linestyle=':', label='Derivative = 0')
plt.xlabel('x')
plt.ylabel("f'(x)")
plt.title("Derivative (Slope)")
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

print("Key Observations:")
print("- The function has its minimum at x = 3")
print("- The derivative equals zero at the minimum")
print("- Negative derivative means function is decreasing")
print("- Positive derivative means function is increasing")

## 🔗 Chain Rule in Action

The chain rule is crucial for neural networks! It tells us how to find derivatives of nested functions - exactly what we need when signals pass through multiple layers.

**Formula**: If f(g(x)), then df/dx = df/dg × dg/dx


In [None]:
# Example: f(g(x)) = (2x + 1)^2
# This is like a simple neural network: input → linear transformation → activation

def g(x):
    """Inner function: linear transformation"""
    return 2*x + 1

def f(g_val):
    """Outer function: square activation"""
    return g_val**2

def df_dg(g_val):
    """Derivative of outer function"""
    return 2 * g_val

def dg_dx(x):
    """Derivative of inner function"""
    return 2

# Chain rule in action
x_val = 2
g_val = g(x_val)
f_val = f(g_val)

# Manual chain rule calculation
df_dx_manual = df_dg(g_val) * dg_dx(x_val)

print(f"At x = {x_val}:")
print(f"g(x) = 2x + 1 = {g_val}")
print(f"f(g(x)) = g² = {f_val}")
print(f"df/dg = 2g = {df_dg(g_val)}")
print(f"dg/dx = 2 = {dg_dx(x_val)}")
print(f"df/dx = df/dg × dg/dx = {df_dx_manual}")

# Verify with direct calculation
print(f"\nVerification: f(x) = (2x + 1)² = {(2*x_val + 1)**2}")
print(f"Direct derivative: df/dx = 4(2x + 1) = {4*(2*x_val + 1)}")
print("✅ Chain rule matches direct calculation!")

## 🧗 Gradient Descent Demo

Gradient descent is the workhorse of machine learning! It uses derivatives to iteratively find the minimum of a function - exactly how neural networks learn.

**Algorithm**: 
1. Start at some point
2. Calculate the gradient (derivative)
3. Move in the opposite direction of the gradient
4. Repeat until convergence


In [None]:
# Enhanced gradient descent with visualization
def gradient_descent_demo():
    # Our cost function: f(x) = (x - 3)^2
    def cost_function(x):
        return (x - 3)**2

    def gradient(x):
        return 2 * (x - 3)

    # Different learning rates to show their effect
    learning_rates = [0.01, 0.1, 0.5]
    colors = ['blue', 'red', 'green']

    plt.figure(figsize=(15, 5))

    for i, lr in enumerate(learning_rates):
        plt.subplot(1, 3, i+1)

        # Initialize
        x = 0.0
        history_x = [x]
        history_cost = [cost_function(x)]

        # Run gradient descent
        for iteration in range(20):
            grad = gradient(x)
            x = x - lr * grad
            history_x.append(x)
            history_cost.append(cost_function(x))

        # Plot the cost function
        x_range = np.linspace(-1, 7, 100)
        y_range = cost_function(x_range)
        plt.plot(x_range, y_range, 'k-', alpha=0.3, label='Cost function')

        # Plot the path
        plt.plot(history_x, history_cost, 'o-', color=colors[i], 
                label=f'LR={lr}', markersize=4)
        plt.axhline(0, color='gray', linestyle='--', alpha=0.5)
        plt.axvline(3, color='gray', linestyle='--', alpha=0.5)

        plt.title(f'Learning Rate = {lr}')
        plt.xlabel('x')
        plt.ylabel('Cost')
        plt.legend()
        plt.grid(True)

        # Print final results
        print(f"Learning Rate {lr}: Final x = {history_x[-1]:.4f}, Final cost = {history_cost[-1]:.6f}")

    plt.tight_layout()
    plt.show()

    print("\nKey Observations:")
    print("- Too small learning rate (0.01): Slow convergence")
    print("- Good learning rate (0.1): Smooth, fast convergence")
    print("- Too large learning rate (0.5): May overshoot or oscillate")

gradient_descent_demo()

# Show the mathematical intuition
print("\n🧠 Mathematical Intuition:")
print("- Gradient points uphill (direction of steepest increase)")
print("- We want to go downhill, so we move opposite to gradient")
print("- Learning rate controls step size")
print("- At minimum, gradient = 0, so we stop moving")

## ✅ Summary Quiz & Checklist

### Quiz Questions
1. **What does a derivative represent?**
   > A derivative represents the instantaneous rate of change of a function at a specific point. Geometrically, it's the slope of the tangent line to the function at that point.

2. **What does the gradient tell us in multiple dimensions?**
   > The gradient is a vector that points in the direction of steepest increase of a function. In machine learning, we use the negative gradient to find the direction of steepest decrease (toward the minimum).

3. **Why is the learning rate important in gradient descent?**
   > The learning rate controls how big steps we take toward the minimum. Too small = slow convergence, too large = overshooting or divergence.

4. **How does the chain rule apply to neural networks?**
   > Neural networks are compositions of functions (layers). The chain rule allows us to compute gradients through these compositions, enabling backpropagation - the algorithm that trains neural networks.

### Self-Assessment Checklist
Check off each item as you master it:

- [ ] I understand how to compute and interpret a derivative
- [ ] I understand what a gradient is in multiple dimensions
- [ ] I can apply the chain rule to nested functions
- [ ] I can visualize and explain gradient descent
- [ ] I know how learning rate affects convergence
- [ ] I can connect derivatives to machine learning optimization
- [ ] I understand why the chain rule is crucial for neural networks

### 🔗 Next Steps
- Review the [companion theory file](03_calculus_for_ml.md) for deeper mathematical insights
- Practice computing derivatives of different functions
- Experiment with different learning rates in gradient descent
- Think about how these concepts apply to training neural networks

### 💡 Key Takeaways
- **Derivatives**: Measure how functions change (slopes)
- **Chain Rule**: Essential for computing gradients through function compositions
- **Gradient Descent**: Uses derivatives to find function minima
- **Learning Rate**: Critical hyperparameter that controls optimization speed
- **ML Connection**: All modern ML training relies on gradient-based optimization!
