# Project 1: Linear Regression from First Principles

## Goal
Understand gradient descent, loss functions, and parameter updates at the most basic level.

## Learning Objectives
- Understand what a loss function is and why MSE
- How gradient descent works mathematically
- What is a learning rate and how it affects convergence
- What does an epoch mean
- How to know when to stop training

## Date Started
November 8, 2025

---
## Part 1: Theoretical Foundation

### Linear Regression Model
The linear regression model predicts output $y$ from input $x$ using:

$$\hat{y} = w \cdot x + b$$

Where:
- $w$ = weight (slope)
- $b$ = bias (intercept)
- $\hat{y}$ = predicted value

### Loss Function: Mean Squared Error (MSE)
We measure how well our model fits the data using MSE:

$$L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

**Why MSE?** It penalizes larger errors more heavily and is differentiable.

### Gradient Descent
To minimize the loss, we update parameters in the direction that reduces loss:

$$w := w - \alpha \frac{\partial L}{\partial w}$$
$$b := b - \alpha \frac{\partial L}{\partial b}$$

Where:
- $\alpha$ = learning rate (step size)
- $\frac{\partial L}{\partial w}$ = gradient of loss with respect to weight

### Gradients for Linear Regression
$$\frac{\partial L}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i(y_i - \hat{y}_i)$$
$$\frac{\partial L}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)$$

---
## Part 2: Setup and Data Generation

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import sys
sys.path.append('/Users/mark/git/learning-ml-to-llm')

from utils.data_generators import generate_linear_data
from utils.visualization import plot_loss_curve, plot_regression_line

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

In [None]:
# Generate synthetic data
# True relationship: y = 4 + 3x + noise

TRUE_SLOPE = 3.0
TRUE_INTERCEPT = 4.0
N_SAMPLES = 100
NOISE_STD = 1.0

X, y = generate_linear_data(
    n_samples=N_SAMPLES,
    slope=TRUE_SLOPE,
    intercept=TRUE_INTERCEPT,
    noise_std=NOISE_STD,
    random_state=42
)

print(f"Generated {N_SAMPLES} data points")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"\nTrue parameters:")
print(f"  Slope (w): {TRUE_SLOPE}")
print(f"  Intercept (b): {TRUE_INTERCEPT}")

In [None]:
# Visualize the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, s=50, color='blue', label='Data Points')
plt.xlabel('X', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Synthetic Linear Data with Noise', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

print("\nData Statistics:")
print(f"X - Mean: {X.mean():.2f}, Std: {X.std():.2f}, Range: [{X.min():.2f}, {X.max():.2f}]")
print(f"y - Mean: {y.mean():.2f}, Std: {y.std():.2f}, Range: [{y.min():.2f}, {y.max():.2f}]")

---
## Part 3: Implement Linear Regression from Scratch

In [None]:
class LinearRegressionFromScratch:
    """
    Linear Regression implemented from first principles using gradient descent
    """
    
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        """
        Initialize the model
        
        Args:
            learning_rate: Step size for gradient descent
            n_iterations: Number of training iterations
        """
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weight = None
        self.bias = None
        self.loss_history = []
        self.weight_history = []
        self.bias_history = []
    
    def _compute_loss(self, y_true, y_pred):
        """
        Compute Mean Squared Error loss
        
        Args:
            y_true: True values
            y_pred: Predicted values
        
        Returns:
            MSE loss value
        """
        n = len(y_true)
        loss = (1 / n) * np.sum((y_true - y_pred) ** 2)
        return loss
    
    def _compute_gradients(self, X, y_true, y_pred):
        """
        Compute gradients of loss with respect to weight and bias
        
        Args:
            X: Input features
            y_true: True values
            y_pred: Predicted values
        
        Returns:
            dw: Gradient with respect to weight
            db: Gradient with respect to bias
        """
        n = len(y_true)
        
        # Compute gradients
        error = y_true - y_pred
        dw = -(2 / n) * np.sum(X * error)
        db = -(2 / n) * np.sum(error)
        
        return dw, db
    
    def predict(self, X):
        """
        Make predictions using current parameters
        
        Args:
            X: Input features
        
        Returns:
            Predictions
        """
        return self.weight * X + self.bias
    
    def fit(self, X, y, verbose=True):
        """
        Train the model using gradient descent
        
        Args:
            X: Training features
            y: Training targets
            verbose: Whether to print progress
        """
        n_samples = X.shape[0]
        
        # Initialize parameters randomly
        self.weight = np.random.randn()
        self.bias = np.random.randn()
        
        if verbose:
            print(f"Initial parameters: w={self.weight:.4f}, b={self.bias:.4f}")
        
        # Gradient descent loop
        for iteration in range(self.n_iterations):
            # Forward pass: compute predictions
            y_pred = self.predict(X)
            
            # Compute loss
            loss = self._compute_loss(y, y_pred)
            self.loss_history.append(loss)
            
            # Compute gradients
            dw, db = self._compute_gradients(X, y, y_pred)
            
            # Update parameters
            self.weight -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Store parameter history
            self.weight_history.append(self.weight)
            self.bias_history.append(self.bias)
            
            # Print progress
            if verbose and (iteration % 100 == 0 or iteration == self.n_iterations - 1):
                print(f"Iteration {iteration:4d}: Loss={loss:.4f}, w={self.weight:.4f}, b={self.bias:.4f}")
        
        if verbose:
            print(f"\nTraining complete!")
            print(f"Final parameters: w={self.weight:.4f}, b={self.bias:.4f}")

print("LinearRegressionFromScratch class defined!")

---
## Part 4: Train the Model

In [None]:
# Create and train the model
LEARNING_RATE = 0.1
N_ITERATIONS = 1000

model = LinearRegressionFromScratch(
    learning_rate=LEARNING_RATE,
    n_iterations=N_ITERATIONS
)

print(f"Training Linear Regression with:")
print(f"  Learning Rate: {LEARNING_RATE}")
print(f"  Iterations: {N_ITERATIONS}")
print(f"\n{'='*60}\n")

model.fit(X, y, verbose=True)

In [None]:
# Compare learned parameters with true parameters
print("\n" + "="*60)
print("PARAMETER COMPARISON")
print("="*60)
print(f"\nSlope (weight):")
print(f"  True value:    {TRUE_SLOPE:.4f}")
print(f"  Learned value: {model.weight:.4f}")
print(f"  Error:         {abs(TRUE_SLOPE - model.weight):.4f}")

print(f"\nIntercept (bias):")
print(f"  True value:    {TRUE_INTERCEPT:.4f}")
print(f"  Learned value: {model.bias:.4f}")
print(f"  Error:         {abs(TRUE_INTERCEPT - model.bias):.4f}")

# Calculate final loss
final_predictions = model.predict(X)
final_loss = model._compute_loss(y, final_predictions)
print(f"\nFinal Loss (MSE): {final_loss:.4f}")

---
## Part 5: Visualize Training Process

In [None]:
# Plot loss curve
plot_loss_curve(
    model.loss_history,
    title="Training Loss Over Iterations",
    xlabel="Iteration",
    ylabel="Mean Squared Error"
)

print(f"Initial Loss: {model.loss_history[0]:.4f}")
print(f"Final Loss: {model.loss_history[-1]:.4f}")
print(f"Loss Reduction: {(1 - model.loss_history[-1]/model.loss_history[0]) * 100:.2f}%")

In [None]:
# Plot learned regression line
predictions = model.predict(X)

plot_regression_line(
    X, y, predictions,
    title=f"Linear Regression Fit (w={model.weight:.2f}, b={model.bias:.2f})"
)

In [None]:
# Plot parameter trajectory (how parameters changed during training)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Weight trajectory
ax1.plot(model.weight_history, linewidth=2, color='blue')
ax1.axhline(y=TRUE_SLOPE, color='red', linestyle='--', linewidth=2, label=f'True value ({TRUE_SLOPE})')
ax1.set_xlabel('Iteration', fontsize=12)
ax1.set_ylabel('Weight Value', fontsize=12)
ax1.set_title('Weight Trajectory During Training', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend()

# Bias trajectory
ax2.plot(model.bias_history, linewidth=2, color='green')
ax2.axhline(y=TRUE_INTERCEPT, color='red', linestyle='--', linewidth=2, label=f'True value ({TRUE_INTERCEPT})')
ax2.set_xlabel('Iteration', fontsize=12)
ax2.set_ylabel('Bias Value', fontsize=12)
ax2.set_title('Bias Trajectory During Training', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend()

plt.tight_layout()
plt.show()

print("Observe how the parameters converge to their true values!")

---
## Part 6: Experiment with Different Learning Rates

**Key Question:** How does the learning rate affect convergence?

In [None]:
# Test different learning rates
learning_rates = [0.001, 0.01, 0.1, 0.5]
models = {}

print("Training models with different learning rates...\n")

for lr in learning_rates:
    print(f"Learning Rate: {lr}")
    print("-" * 40)
    
    model_temp = LinearRegressionFromScratch(
        learning_rate=lr,
        n_iterations=1000
    )
    model_temp.fit(X, y, verbose=False)
    
    models[lr] = model_temp
    
    print(f"  Final Loss: {model_temp.loss_history[-1]:.4f}")
    print(f"  Final w: {model_temp.weight:.4f}")
    print(f"  Final b: {model_temp.bias:.4f}")
    print()

In [None]:
# Compare loss curves for different learning rates
from utils.visualization import plot_learning_rate_comparison

losses_dict = {lr: model.loss_history for lr, model in models.items()}

plot_learning_rate_comparison(learning_rates, losses_dict)

In [None]:
# Detailed comparison
print("\n" + "="*70)
print("LEARNING RATE EFFECTS ANALYSIS")
print("="*70)

for lr in learning_rates:
    model_temp = models[lr]
    initial_loss = model_temp.loss_history[0]
    final_loss = model_temp.loss_history[-1]
    
    print(f"\nLearning Rate: {lr}")
    print(f"  Initial Loss:    {initial_loss:.4f}")
    print(f"  Final Loss:      {final_loss:.4f}")
    print(f"  Loss Reduction:  {(1 - final_loss/initial_loss) * 100:.2f}%")
    print(f"  Weight Error:    {abs(TRUE_SLOPE - model_temp.weight):.4f}")
    print(f"  Bias Error:      {abs(TRUE_INTERCEPT - model_temp.bias):.4f}")
    
    # Check convergence
    if len(model_temp.loss_history) > 100:
        last_100_change = abs(model_temp.loss_history[-1] - model_temp.loss_history[-100])
        if last_100_change < 0.001:
            print(f"  Status: ✓ Converged")
        else:
            print(f"  Status: ⚠ Still changing (Δ={last_100_change:.4f})")

---
## Part 7: Compare with sklearn Implementation

In [None]:
# Compare with sklearn's LinearRegression
from sklearn.linear_model import LinearRegression as SklearnLR

sklearn_model = SklearnLR()
sklearn_model.fit(X, y)

print("\n" + "="*70)
print("COMPARISON: Our Implementation vs sklearn")
print("="*70)

print(f"\nSlope (weight):")
print(f"  Our model:     {model.weight:.6f}")
print(f"  sklearn:       {sklearn_model.coef_[0][0]:.6f}")
print(f"  Difference:    {abs(model.weight - sklearn_model.coef_[0][0]):.6f}")

print(f"\nIntercept (bias):")
print(f"  Our model:     {model.bias:.6f}")
print(f"  sklearn:       {sklearn_model.intercept_[0]:.6f}")
print(f"  Difference:    {abs(model.bias - sklearn_model.intercept_[0]):.6f}")

# Compare predictions
our_predictions = model.predict(X)
sklearn_predictions = sklearn_model.predict(X)

prediction_diff = np.mean(np.abs(our_predictions - sklearn_predictions))
print(f"\nMean Absolute Difference in Predictions: {prediction_diff:.6f}")

if prediction_diff < 0.01:
    print("\n✓ Our implementation matches sklearn very closely!")
else:
    print("\n⚠ Some difference exists (this is normal due to different optimization methods)")

---
## Part 8: Key Learnings and Reflections

### What I Learned

1. **Loss Function (MSE)**
   - Measures prediction error
   - Differentiable (needed for gradient descent)
   - Penalizes large errors more

2. **Gradient Descent**
   - Iteratively updates parameters
   - Moves in direction of steepest descent
   - Requires tuning learning rate

3. **Learning Rate Effects**
   - Too small: slow convergence
   - Too large: oscillation or divergence
   - Sweet spot: steady decrease in loss

4. **Convergence**
   - Loss plateaus when parameters are near optimal
   - Can monitor gradient magnitude
   - Early stopping possible

### Connections to Future Projects

- **Same gradient descent** will be used in neural networks and transformers
- **Same loss monitoring** applies to all ML training
- **Learning rate tuning** is critical in deep learning
- **Parameter updates** are the core of all optimization

---
## Part 9: Next Steps

### Experiments to Try
- [ ] Add more features (multi-dimensional regression)
- [ ] Implement mini-batch gradient descent
- [ ] Add momentum to gradient descent
- [ ] Try adaptive learning rates
- [ ] Test on real datasets

### Move to Project 2
Next: **Binary Classification with Logistic Regression**
- Different loss function (cross-entropy)
- Sigmoid activation
- Classification metrics