# Gradient Descent Basics

This notebook demonstrates the fundamentals of gradient descent optimization.

## Contents
1. Understanding Gradient Descent
2. Simple Example: f(x,y) = x² + y²
3. Visualizing the Optimization Path
4. Effect of Learning Rate
5. Interactive Demo

In [None]:
import sys
sys.path.append('../src')

import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider, IntSlider
import warnings
warnings.filterwarnings('ignore')

from gradient_descent import (
    gradient_descent,
    quadratic_function,
    quadratic_gradient,
    rosenbrock_function,
    rosenbrock_gradient
)
from visualizations import (
    plot_contour_with_path,
    plot_3d_surface_with_path,
    plot_gradient_vectors,
    plot_learning_rate_comparison
)

%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 8)

## 1. Understanding Gradient Descent

Gradient descent is an iterative optimization algorithm for finding the minimum of a function.

**Algorithm:**
```
Initialize: x₀ (starting point)
Repeat until convergence:
    1. Compute gradient: ∇f(xₜ)
    2. Update: xₜ₊₁ = xₜ - α * ∇f(xₜ)
```

Where:
- α (alpha) is the learning rate
- ∇f(xₜ) is the gradient at point xₜ
- The negative sign indicates we move in the opposite direction of the gradient (downhill)

## 2. Simple Example: f(x,y) = x² + y²

Let's start with a simple quadratic function that has a global minimum at (0, 0).

In [None]:
# Define starting point
x0 = np.array([4.0, 3.0])

# Run gradient descent
x_opt, path, cost_history = gradient_descent(
    f=quadratic_function,
    grad_f=quadratic_gradient,
    x0=x0,
    alpha=0.1,
    max_iter=50,
    verbose=True
)

print(f"\nStarting point: {x0}")
print(f"Optimal point: {x_opt}")
print(f"Final cost: {quadratic_function(x_opt):.8f}")
print(f"Number of iterations: {len(path) - 1}")

## 3. Visualizing the Optimization Path

In [None]:
# Visualize on contour plot
plot_contour_with_path(
    quadratic_function,
    path,
    x_range=(-5, 5),
    y_range=(-5, 5),
    title="Gradient Descent on f(x,y) = x² + y²"
)

In [None]:
# Visualize on 3D surface
plot_3d_surface_with_path(
    quadratic_function,
    path,
    x_range=(-5, 5),
    y_range=(-5, 5),
    title="Gradient Descent on 3D Surface"
)

In [None]:
# Visualize gradient vectors
plot_gradient_vectors(
    quadratic_function,
    quadratic_gradient,
    x_range=(-5, 5),
    y_range=(-5, 5),
    grid_density=15,
    title="Gradient Vector Field (Descent Directions)"
)

In [None]:
# Plot cost vs iteration
plt.figure(figsize=(10, 6))
plt.plot(cost_history, 'b-', linewidth=2, marker='o', markersize=4)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Cost f(x,y)', fontsize=12)
plt.title('Cost Function Convergence', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

## 4. Effect of Learning Rate

The learning rate (α) is crucial:
- **Too small**: Slow convergence
- **Too large**: Oscillation or divergence
- **Just right**: Fast and stable convergence

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.1, 0.3, 0.5]

plot_learning_rate_comparison(
    f=quadratic_function,
    grad_f=quadratic_gradient,
    x0=np.array([4.0, 3.0]),
    learning_rates=learning_rates,
    max_iter=50
)

## 5. Interactive Demo

Experiment with different parameters!

In [None]:
def interactive_gradient_descent(start_x=4.0, start_y=3.0, alpha=0.1, max_iter=50):
    """Interactive gradient descent visualization"""
    x0 = np.array([start_x, start_y])
    
    # Run gradient descent
    x_opt, path, cost_history = gradient_descent(
        quadratic_function,
        quadratic_gradient,
        x0,
        alpha=alpha,
        max_iter=max_iter,
        tol=1e-10
    )
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Left plot: Contour with path
    x = np.linspace(-6, 6, 200)
    y = np.linspace(-6, 6, 200)
    X, Y = np.meshgrid(x, y)
    Z = X**2 + Y**2
    
    ax1.contour(X, Y, Z, levels=30, cmap='viridis', alpha=0.6)
    ax1.contourf(X, Y, Z, levels=30, cmap='viridis', alpha=0.3)
    
    path_array = np.array(path)
    ax1.plot(path_array[:, 0], path_array[:, 1], 'r.-', linewidth=2, markersize=8)
    ax1.plot(path_array[0, 0], path_array[0, 1], 'go', markersize=15, label='Start')
    ax1.plot(path_array[-1, 0], path_array[-1, 1], 'r*', markersize=20, label='End')
    
    ax1.set_xlabel('x', fontsize=12)
    ax1.set_ylabel('y', fontsize=12)
    ax1.set_title(f'Optimization Path ({len(path)-1} iterations)', fontsize=13, fontweight='bold')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Right plot: Cost history
    ax2.plot(cost_history, 'b-', linewidth=2, marker='o', markersize=5)
    ax2.set_xlabel('Iteration', fontsize=12)
    ax2.set_ylabel('Cost', fontsize=12)
    ax2.set_title('Cost Convergence', fontsize=13, fontweight='bold')
    ax2.grid(True, alpha=0.3)
    ax2.set_yscale('log')
    
    # Add statistics
    stats_text = f'Final point: ({x_opt[0]:.4f}, {x_opt[1]:.4f})\n'
    stats_text += f'Final cost: {quadratic_function(x_opt):.6f}\n'
    stats_text += f'Iterations: {len(path)-1}'
    
    ax2.text(0.05, 0.95, stats_text, transform=ax2.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
    
    plt.tight_layout()
    plt.show()

# Create interactive widget
interact(
    interactive_gradient_descent,
    start_x=FloatSlider(min=-5, max=5, step=0.5, value=4.0, description='Start X:'),
    start_y=FloatSlider(min=-5, max=5, step=0.5, value=3.0, description='Start Y:'),
    alpha=FloatSlider(min=0.01, max=0.5, step=0.01, value=0.1, description='Learning Rate:'),
    max_iter=IntSlider(min=10, max=100, step=5, value=50, description='Max Iterations:')
);

## 6. Try Rosenbrock Function

The Rosenbrock function is a more challenging optimization problem with a banana-shaped valley.

In [None]:
# Rosenbrock function: f(x,y) = (1-x)² + 100(y-x²)²
# Global minimum at (1, 1)

x0 = np.array([0.0, 0.0])

x_opt, path, cost_history = gradient_descent(
    rosenbrock_function,
    rosenbrock_gradient,
    x0,
    alpha=0.001,
    max_iter=500,
    verbose=True
)

print(f"\nOptimal point: {x_opt}")
print(f"True minimum: [1.0, 1.0]")
print(f"Final cost: {rosenbrock_function(x_opt):.8f}")

In [None]:
# Visualize Rosenbrock optimization
plot_contour_with_path(
    rosenbrock_function,
    path,
    x_range=(-2, 2),
    y_range=(-1, 3),
    title="Gradient Descent on Rosenbrock Function",
    levels=50
)

## Key Takeaways

1. **Gradient descent moves in the direction of steepest descent**
2. **Learning rate is critical** - needs to be tuned for each problem
3. **Different functions have different convergence properties**
4. **Visualization helps understand the optimization process**
5. **Convergence speed depends on function landscape and parameters**