<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Linear%20Regression/Linear%20Regression%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression Hands-On Lab

In this lab, you will implement a Linear Regression model from scratch, understand the mathematics behind it, and apply it to real data. Along the way, you'll answer conceptual questions and create visualizations to deepen your understanding.

**Learning Objectives:**
- Understand the mathematics of Linear Regression
- Implement a custom Linear Regression class from scratch
- Visualize regression lines and prediction errors
- Apply feature scaling and understand its importance
- Tune models using train/validation/test splits
- Compare custom implementation with scikit-learn
- Analyze model performance and errors

## Overview of Linear Regression

Linear Regression is a **parametric supervised learning algorithm** used for **regression tasks** (predicting continuous values).

**Key Idea:**
- Find a **linear function** that best fits the training data
- Model: **y = w₁x + w₀** (for single feature) or **y = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ** (for multiple features)
- **w₀** is the intercept (bias term)
- **w₁, w₂, ..., wₙ** are the slopes (weights)

**How it works:**
1. **Training:** Find optimal weights **w** using the **normal equation**: **w = (ΦᵀΦ)⁻¹Φᵀy**
2. **Prediction:** Compute **ŷ = Φw** where Φ is the design matrix with bias column

**Advantages:**
- Simple and interpretable
- Fast training (closed-form solution)
- Fast prediction
- Works well when relationships are linear

**Disadvantages:**
- Assumes linear relationship
- Sensitive to outliers
- Can't capture complex non-linear patterns (without feature engineering)
- Matrix inversion can be computationally expensive for very large datasets

> **Question**: Linear Regression finds the best fit line by:
>
> A. Computing the closed-form solution using the normal equation w = (ΦᵀΦ)⁻¹Φᵀy
>
> B. Iteratively searching through all possible lines  
> C. Using the K nearest neighbors to draw a line
>
> D. Randomly selecting weights until a good fit is found

<details><summary>Click to reveal answer</summary>
**Answer: A**

Linear Regression uses the normal equation to compute the optimal weights directly in one step, without any iterative training.
</details>

## Model Complexity: From Simple to Complex

While basic Linear Regression fits a straight line, we can create more complex models using **polynomial features**.

**Examples:**
- **Degree 1 (Linear):** y = w₀ + w₁x
- **Degree 2 (Quadratic):** y = w₀ + w₁x + w₂x²
- **Degree 3 (Cubic):** y = w₀ + w₁x + w₂x² + w₃x³

As model complexity increases:
- **Low complexity (underfit):** Model is too simple, high bias, misses patterns
- **Right complexity:** Model generalizes well, balanced bias-variance
- **High complexity (overfit):** Model is too complex, high variance, memorizes noise

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate sample data
np.random.seed(42)
X = np.linspace(-3, 3, 50).reshape(-1, 1)
y = 0.5 * X.ravel()**2 + X.ravel() + 2 + np.random.normal(0, 1, 50)

# Create figure with subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
degrees = [1, 2, 9]
titles = ['Underfitting (Degree 1)', 'Good Fit (Degree 2)', 'Overfitting (Degree 9)']

for ax, degree, title in zip(axes, degrees, titles):
    # Fit polynomial regression
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly.fit_transform(X)
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # Plot
    X_plot = np.linspace(-3, 3, 200).reshape(-1, 1)
    X_plot_poly = poly.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    
    ax.scatter(X, y, c='lightblue', s=50, edgecolors='black', label='Data')
    ax.plot(X_plot, y_plot, 'r-', linewidth=2, label=f'Degree {degree}')
    ax.set_xlabel('x', fontsize=12)
    ax.set_ylabel('y', fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Bias-Variance Trade-off in Linear Regression

**Bias** = Error from overly simplistic assumptions
- High bias → Underfitting → Model too simple
- Example: Fitting a line to quadratic data

**Variance** = Error from sensitivity to small fluctuations in training data
- High variance → Overfitting → Model too complex
- Example: High-degree polynomial that memorizes noise

**Goal:** Find the sweet spot that minimizes **total error = bias² + variance + irreducible error**

> **Question**: A linear regression model with polynomial degree 1 has high training and validation error. This indicates:
>
> A. High bias (underfitting) - the model is too simple
>
> B. High variance (overfitting) - the model is too complex  
> C. Perfect fit - the model generalizes well
>
> D. The data has too much noise

<details><summary>Click to reveal answer</summary>
**Answer: A**

When both training and validation errors are high, the model is underfitting - it's too simple to capture the underlying pattern.
</details>

## Feature Scaling in Linear Regression

While Linear Regression's predictions aren't affected by feature scaling (unlike KNN), scaling is still important because:

1. **Numerical Stability:** Computing (ΦᵀΦ)⁻¹ can be numerically unstable when features have very different scales
2. **Regularization:** If using Ridge or Lasso regression, unscaled features will be penalized unfairly
3. **Optimization:** When using gradient descent instead of the normal equation, convergence is much faster with scaled features
4. **Interpretability:** Standardized coefficients can be compared to determine feature importance

**Best Practice:** Always scale features for regression tasks!

## Z-Score Standardization

### The Z-Score Formula

$$z = \frac{x - \mu}{\sigma}$$

Where:
- **x** = original value
- **μ** = mean of the feature
- **σ** = standard deviation of the feature
- **z** = standardized value

### What This Does
- Centers data around 0 (mean = 0)
- Scales to unit variance (std = 1)
- Preserves the distribution shape
- Makes features comparable

### Critical Rule for Train/Validation/Test Splits
```python
# 1. Fit scaler on training data ONLY
scaler = StandardScaler()
scaler.fit(X_train)  # Compute μ and σ from training data

# 2. Transform all sets using SAME parameters
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)      # Use training μ and σ
X_test_scaled = scaler.transform(X_test)    # Use training μ and σ
```

**Why?** To prevent **data leakage** - the model should not have any information about validation/test sets during training!

> **Question**: When using StandardScaler with train/validation/test splits, what is the correct approach?
>
> A. Fit the scaler on training data only, then transform all sets using those parameters
>
> B. Fit and transform each set independently  
> C. Fit the scaler on all data first, then split
>
> D. Scaling is not needed for Linear Regression

<details><summary>Click to reveal answer</summary>
**Answer: A**

To avoid data leakage, always fit the scaler on training data only, then use those same parameters (mean and std) to transform validation and test sets.
</details>

## Pseudocode for Linear Regression

### Formal Pseudocode

```
============================================
Inputs
============================================
X       ← training features (N × d matrix)
y       ← training targets (N × 1 vector)
X_query ← examples to predict

============================================
----- fit -----
============================================
1. Store X and y
2. Add bias column: Φ ← [1, X]  # (N × (d+1))
3. Compute weights using normal equation:
   w ← (ΦᵀΦ)⁻¹Φᵀy
4. Store w

============================================
----- predict -----
============================================
For each query point in X_query:
1. Add bias: Φ_query ← [1, X_query]
2. Compute prediction: ŷ ← Φ_query · w
3. Return ŷ
```

### Key Observations
- **No iterations needed:** One-step solution via normal equation
- **Fast prediction:** Just matrix multiplication
- **Memory efficient:** Only stores weights (not all training data like KNN)
- **Global model:** Learns one function for entire space (unlike local KNN)

## Implementing a Custom Linear Regression Class

Below is a scaffold of the `MyLinearRegressor` class. Fill in the TODO sections to complete the implementation:

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin

class MyLinearRegressor(BaseEstimator, RegressorMixin):
    """
    Custom Linear Regression implementation using the normal equation.
    
    Parameters:
    -----------
    None
    
    Attributes:
    -----------
    weights_ : array of shape (n_features + 1,)
        Learned weights including bias term
    """
    
    def __init__(self):
        pass
    
    def fit(self, X, y):
        """
        Fit the linear regression model using the normal equation.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training data
        y : array-like of shape (n_samples,)
            Target values
        
        Returns:
        --------
        self
        """
        # TODO: Add column of ones for bias term
        # Hint: Use np.c_[np.ones(len(X)), X] to create design matrix Phi
        Phi = None  # Replace with your code
        
        # TODO: Compute weights using normal equation: w = (Phi^T Phi)^{-1} Phi^T y
        # Hint: Use @ for matrix multiplication, .T for transpose, np.linalg.inv() for inverse
        self.weights_ = None  # Replace with your code
        
        return self
    
    def predict(self, X):
        """
        Predict using the linear model.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Samples to predict
        
        Returns:
        --------
        y_pred : array of shape (n_samples,)
            Predicted values
        """
        # TODO: Add column of ones for bias term
        Phi = None  # Replace with your code
        
        # TODO: Compute predictions: y_pred = Phi @ weights
        y_pred = None  # Replace with your code
        
        return y_pred

### Test Your Implementation

Once you have filled in the implementation, let's test our custom regressor on a simple dataset to ensure it works as expected.

In [None]:
# Create simple test data
np.random.seed(42)
X_simple = np.array([[1], [2], [3], [4], [5]])
y_simple = np.array([2, 4, 6, 8, 10])  # Perfect linear relationship: y = 2x

# Fit model
model = MyLinearRegressor()
model.fit(X_simple, y_simple)

# Make predictions
predictions = model.predict(X_simple)

print("Learned weights (w0=intercept, w1=slope):", model.weights_)
print("Expected: [0, 2] or very close to it")
print("\nPredictions:", predictions)
print("Actual:     ", y_simple)
print("\nMean Squared Error:", np.mean((predictions - y_simple)**2))
print("Expected: ~0 (perfect fit)")

> **Question**: Unlike algorithms such as KNN, Linear Regression has very fast prediction time even with large datasets. Why?
>
> A. Linear Regression learns a parametric model (weights) during training. Prediction is just a simple matrix multiplication, regardless of training set size.
>
> B. Linear Regression stores fewer training examples than KNN.  
> C. Linear Regression uses a faster distance metric.
>
> D. Linear Regression approximates predictions rather than computing exact values.

<details><summary>Click to reveal answer</summary>
**Answer: A**

Linear Regression learns fixed weights during training. At prediction time, it only needs to compute ŷ = Φw, which is O(d) where d is the number of features. KNN needs to compute distances to all N training points, which is O(Nd).
</details>

## A Dataset for Visualization

Let's work with the same synthetic dataset from the Code Walk Through to visualize how Linear Regression works.

In [None]:
# Generate the same data as in Code Walk Through
np.random.seed(42)
X_train = np.arange(-9.5, 8.5, 0.1).reshape(-1, 1)
y_train = X_train.ravel() + 1 + np.random.normal(0, 2, len(X_train))

print(f"Training data: {len(X_train)} points")
print(f"X range: [{X_train.min():.1f}, {X_train.max():.1f}]")
print(f"y range: [{y_train.min():.1f}, {y_train.max():.1f}]")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, edgecolors='black', linewidths=0.5)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Training Data: Linear Relationship with Noise', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

## Training and Visualizing the Model

In [None]:
# TODO: Fit your MyLinearRegressor on the training data
# Hint: model = MyLinearRegressor()
#       model.fit(X_train, y_train)

# Your code here


print(f"Learned weights: {model.weights_}")
print(f"Model equation: y = {model.weights_[1]:.3f}x + {model.weights_[0]:.3f}")

In [None]:
# Visualize the fit
x_line = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
y_line = model.predict(x_line)

plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, edgecolors='black', linewidths=0.5, label='Training data')
plt.plot(x_line, y_line, 'r-', linewidth=2, label=f'Best fit: y={model.weights_[1]:.2f}x+{model.weights_[0]:.2f}')
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Linear Regression: Best Fit Line', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

## Understanding Predictions: Step-by-Step

Let's visualize how the model makes a prediction for a single test point.

In [None]:
# TODO: Make a prediction for x = 5.0
# Hint: X_test = np.array([[5.0]])
#       y_pred = model.predict(X_test)

# Your code here


print(f"For x = {X_test[0, 0]:.1f}:")
print(f"Predicted y = {y_pred[0]:.3f}")
print(f"Calculation: y = {model.weights_[1]:.3f} × {X_test[0, 0]:.1f} + {model.weights_[0]:.3f} = {y_pred[0]:.3f}")

In [None]:
# Visualize the prediction
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, label='Training data')
plt.plot(x_line, y_line, 'k-', linewidth=2, label='Best fit line')
plt.scatter(X_test, y_pred, c='red', s=200, marker='*', edgecolors='black', linewidths=2, 
           label=f'Prediction: x={X_test[0,0]:.1f}, ŷ={y_pred[0]:.2f}', zorder=5)
plt.plot([X_train.min(), X_test[0,0]], [y_pred[0], y_pred[0]], 'r--', alpha=0.5, linewidth=1)
plt.plot([X_test[0,0], X_test[0,0]], [y_train.min(), y_pred[0]], 'r--', alpha=0.5, linewidth=1)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Linear Regression: Making a Prediction', fontsize=16)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

## Regression Metrics: MSE and RMSE

### Mean Squared Error (MSE)

$$\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

- Measures average squared difference between actual and predicted values
- Units are squared (e.g., if y is in dollars, MSE is in dollars²)
- Heavily penalizes large errors (due to squaring)

### Root Mean Squared Error (RMSE)

$$\text{RMSE} = \sqrt{\text{MSE}}$$

- Same units as the target variable
- Easier to interpret (e.g., "average error of $5000")
- Commonly used for regression evaluation

### R² Score (Coefficient of Determination)

$$R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}$$

- Ranges from -∞ to 1 (1 is perfect, 0 means model is no better than predicting mean)
- Represents proportion of variance explained by the model
- Scale-independent (can compare across different datasets)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Compute metrics on training data
y_train_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_train_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_train, y_train_pred)

print(f"Training Metrics:")
print(f"  MSE:  {mse:.3f}")
print(f"  RMSE: {rmse:.3f}")
print(f"  R²:   {r2:.3f}")
print(f"\nInterpretation: The model's predictions are on average {rmse:.2f} units away from actual values.")

## Working with a Real Dataset: California Housing

Now let's apply Linear Regression to a real-world dataset. We'll use the California Housing dataset, which contains information about California districts and median house values.

**Dataset Features:**
- MedInc: Median income in block group
- HouseAge: Median house age in block group  
- AveRooms: Average number of rooms per household
- AveBedrms: Average number of bedrooms per household
- Population: Block group population
- AveOccup: Average number of household members
- Latitude: Block group latitude
- Longitude: Block group longitude

**Target:** Median house value (in $100,000s)

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Create DataFrame for better visualization
df = pd.DataFrame(X, columns=housing.feature_names)
df['MedHouseVal'] = y

print(f"Dataset shape: {X.shape}")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nDataset statistics:")
print(df.describe())

Notice how features have **very different scales:**
- MedInc ranges from ~0.5 to ~15
- Population ranges from ~3 to ~35,000+
- Latitude/Longitude are coordinates

This is why feature scaling is important!

## Splitting into Train, Validation, and Test Sets

**Why 3 splits?**
- **Training set (60%):** Fit the model
- **Validation set (20%):** Tune hyperparameters / compare models
- **Test set (20%):** Final evaluation (touch only once!)

**Important:** Test set simulates real-world unseen data. Never use it for model selection!

In [None]:
from sklearn.model_selection import train_test_split

# TODO: Split into Train (60%), Validation (20%), Test (20%)
# Hint: First split into Train (60%) and Temp (40%)
#       Then split Temp into equal halves for Validation and Test
#       Use random_state=42 for reproducibility

# Your code here


print(f"Training set:   {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X)*100:.0f}%)")
print(f"Test set:       {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")

## Feature Scaling: Comparing Unscaled vs Scaled

### 1. Linear Regression on Unscaled Features

In [None]:
# TODO: Fit MyLinearRegressor on unscaled training data
# TODO: Predict on validation set
# TODO: Compute RMSE and R² scores

# Your code here


print("Performance on UNSCALED features:")
print(f"  Validation RMSE: {rmse_unscaled:.4f}")
print(f"  Validation R²:   {r2_unscaled:.4f}")

### 2. Linear Regression on Scaled Features

In [None]:
from sklearn.preprocessing import StandardScaler

# TODO: Create and fit StandardScaler on training data ONLY
# TODO: Transform train and validation sets
# Hint: scaler = StandardScaler()
#       scaler.fit(X_train)
#       X_train_scaled = scaler.transform(X_train)
#       X_val_scaled = scaler.transform(X_val)

# Your code here


print("Scaled training data statistics:")
print(f"  Mean: {X_train_scaled.mean(axis=0)}")
print(f"  Std:  {X_train_scaled.std(axis=0)}")

In [None]:
# TODO: Fit MyLinearRegressor on scaled training data
# TODO: Predict on scaled validation set
# TODO: Compute RMSE and R² scores

# Your code here


print("Performance on SCALED features:")
print(f"  Validation RMSE: {rmse_scaled:.4f}")
print(f"  Validation R²:   {r2_scaled:.4f}")

### Comparison

In [None]:
print("\n" + "="*50)
print("COMPARISON: Unscaled vs Scaled Features")
print("="*50)
print(f"Unscaled - RMSE: {rmse_unscaled:.4f} | R²: {r2_unscaled:.4f}")
print(f"Scaled   - RMSE: {rmse_scaled:.4f} | R²: {r2_scaled:.4f}")
print("\nNote: For Linear Regression with normal equation, scaling doesn't change predictions.")
print("However, it improves numerical stability and is essential for regularization!")

> **Question**: You observe that a Linear Regression model has RMSE of 0.5 on training data but 0.75 on validation data. What does this suggest?
>
> A. The model generalizes reasonably well - some gap between train and validation error is expected due to overfitting
>
> B. The model is severely underfitting  
> C. There is data leakage from validation to training
>
> D. Feature scaling is needed

<details><summary>Click to reveal answer</summary>
**Answer: A**

A moderate gap between training and validation error is normal - the model fits the training data slightly better than unseen data. The gap indicates some overfitting but not severe. If the gap were very large (e.g., 0.1 vs 2.0), that would indicate serious overfitting.
</details>

## Comparing with Scikit-Learn's Implementation

Let's verify our implementation matches scikit-learn's LinearRegression.

In [None]:
from sklearn.linear_model import LinearRegression as SklearnLR

# Fit sklearn model
sklearn_model = SklearnLR()
sklearn_model.fit(X_train_scaled, y_train)
y_val_pred_sklearn = sklearn_model.predict(X_val_scaled)

# Compute metrics
rmse_sklearn = np.sqrt(mean_squared_error(y_val, y_val_pred_sklearn))
r2_sklearn = r2_score(y_val, y_val_pred_sklearn)

print("\n" + "="*60)
print("COMPARISON: Custom Implementation vs Scikit-Learn")
print("="*60)
print(f"Custom MyLinearRegressor - RMSE: {rmse_scaled:.4f} | R²: {r2_scaled:.4f}")
print(f"Sklearn LinearRegression - RMSE: {rmse_sklearn:.4f} | R²: {r2_sklearn:.4f}")
print(f"\nDifference in RMSE: {abs(rmse_scaled - rmse_sklearn):.6f}")
print(f"Difference in R²:   {abs(r2_scaled - r2_sklearn):.6f}")
print("\n✓ Results should match (within numerical precision)!")

## Visualizing Predictions vs Actual Values

In [None]:
# Create scatter plot of predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_val_pred_scaled, alpha=0.5, edgecolors='black', linewidths=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'r--', linewidth=2, label='Perfect predictions')
plt.xlabel('Actual Values', fontsize=14)
plt.ylabel('Predicted Values', fontsize=14)
plt.title(f'Predictions vs Actual (R² = {r2_scaled:.3f})', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print("Points close to the red line indicate accurate predictions.")
print("Scatter away from the line shows prediction errors.")

## Understanding Prediction Errors: Residual Analysis

In [None]:
# Compute residuals (errors)
residuals = y_val - y_val_pred_scaled

# Create residual plot
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Residuals vs Predicted Values
axes[0].scatter(y_val_pred_scaled, residuals, alpha=0.5, edgecolors='black', linewidths=0.5)
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_xlabel('Predicted Values', fontsize=12)
axes[0].set_ylabel('Residuals (Actual - Predicted)', fontsize=12)
axes[0].set_title('Residual Plot', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Plot 2: Histogram of Residuals
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residuals', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Residuals', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Good residual plots should show:")
print("  1. Residuals randomly scattered around 0 (no patterns)")
print("  2. Approximately normal distribution")
print("  3. Constant variance across all predicted values (homoscedasticity)")

## Impact of Outliers on Linear Regression

Linear Regression is **sensitive to outliers** because it uses the **sum of squared errors** as its objective function. Outliers can significantly distort the regression line by pulling it towards themselves, leading to:

- **Biased coefficient estimates**: The model tries to minimize error for all points, including outliers
- **Reduced model accuracy**: The regression line may not represent the true relationship for most data points
- **Poor generalization**: The model fits extreme points rather than the overall trend

**Key Principle**: Our ML model should perform well in most cases. Therefore, outliers should often be detected and handled appropriately (removed, capped, or modeled separately) before training.

Let's demonstrate the impact of outliers with a visual example:

In [None]:
# Generate clean data with a linear relationship
np.random.seed(42)
X_clean = np.linspace(0, 10, 50).reshape(-1, 1)
y_clean = 2 * X_clean.ravel() + 1 + np.random.normal(0, 1, 50)

# Add an outlier
X_with_outlier = np.vstack([X_clean, [[8.0]]])
y_with_outlier = np.append(y_clean, [5.0])  # This point is far below the trend

# Fit models
model_clean = MyLinearRegressor()
model_clean.fit(X_clean, y_clean)

model_with_outlier = MyLinearRegressor()
model_with_outlier.fit(X_with_outlier, y_with_outlier)

# Create predictions for plotting
X_plot = np.linspace(0, 10, 100).reshape(-1, 1)
y_plot_clean = model_clean.predict(X_plot)
y_plot_with_outlier = model_with_outlier.predict(X_plot)

# Visualize the impact
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot without outlier
ax1.scatter(X_clean, y_clean, alpha=0.6, edgecolors='black', linewidths=0.5, label='Training data')
ax1.plot(X_plot, y_plot_clean, 'r-', linewidth=2, 
         label=f'Fit: y={model_clean.weights_[1]:.2f}x+{model_clean.weights_[0]:.2f}')
ax1.set_xlabel('X', fontsize=12)
ax1.set_ylabel('y', fontsize=12)
ax1.set_title('Linear Regression WITHOUT Outlier', fontsize=14)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot with outlier
ax2.scatter(X_clean, y_clean, alpha=0.6, edgecolors='black', linewidths=0.5, label='Training data')
ax2.scatter([8.0], [5.0], color='red', s=200, marker='*', edgecolors='black', 
           linewidths=2, label='Outlier', zorder=5)
ax2.plot(X_plot, y_plot_clean, 'g--', linewidth=2, alpha=0.5, label='Original fit (without outlier)')
ax2.plot(X_plot, y_plot_with_outlier, 'r-', linewidth=2, 
         label=f'Fit WITH outlier: y={model_with_outlier.weights_[1]:.2f}x+{model_with_outlier.weights_[0]:.2f}')
ax2.set_xlabel('X', fontsize=12)
ax2.set_ylabel('y', fontsize=12)
ax2.set_title('Linear Regression WITH Outlier', fontsize=14)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nImpact of the outlier:")
print(f"  Slope changed from {model_clean.weights_[1]:.3f} to {model_with_outlier.weights_[1]:.3f}")
print(f"  Intercept changed from {model_clean.weights_[0]:.3f} to {model_with_outlier.weights_[0]:.3f}")
print(f"\nThe regression line is 'pulled' toward the outlier, affecting predictions for all points!")

> **Question**: You notice one data point in your training set has an extremely high target value compared to similar inputs. What is the likely impact on your linear regression model?
>
> A. No impact - linear regression automatically ignores outliers during training
>
> B. The regression line will be pulled toward the outlier, potentially degrading predictions for typical data points
>
> C. The model will fit better because it has more diverse training examples
>
> D. Only the intercept will be affected, not the slope
<details><summary>Click to reveal answer</summary>

**Answer: B**

Linear regression minimizes the sum of squared errors across all points. Because outliers have large errors, the model shifts the regression line to reduce their error, which can worsen predictions for the majority of normal points. This is why outlier detection and handling is crucial in linear regression.
</details>

## Final Evaluation on Test Set

Now that we've finalized our approach (using scaled features), let's evaluate on the test set **one time only**.

In [None]:
# TODO: Transform test set using the SAME scaler fitted on training data
# TODO: Predict on test set
# TODO: Compute final RMSE and R² scores

# Your code here


print("\n" + "="*60)
print("FINAL TEST SET EVALUATION")
print("="*60)
print(f"Test RMSE: {test_rmse:.4f}")
print(f"Test R²:   {test_r2:.4f}")
print(f"\nInterpretation: Our model's predictions are on average")
print(f"{test_rmse:.2f} × $100,000 = ${test_rmse*100000:.0f} away from actual house values.")

## Feature Importance Analysis

Since we standardized our features, we can compare the magnitude of learned weights to understand feature importance.

In [None]:
# Extract feature weights (excluding bias)
feature_weights = model_scaled.weights_[1:]  # Skip bias term

# Create DataFrame for visualization
importance_df = pd.DataFrame({
    'Feature': housing.feature_names,
    'Weight': feature_weights
}).sort_values('Weight', key=abs, ascending=False)

# Plot
plt.figure(figsize=(10, 6))
colors = ['green' if x > 0 else 'red' for x in importance_df['Weight']]
plt.barh(importance_df['Feature'], importance_df['Weight'], color=colors, edgecolor='black')
plt.xlabel('Standardized Weight', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance (Standardized Weights)', fontsize=14)
plt.axvline(x=0, color='black', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nFeature Importance:")
print(importance_df.to_string(index=False))
print("\nGreen = Positive correlation (↑ feature → ↑ house value)")
print("Red   = Negative correlation (↑ feature → ↓ house value)")

> **Question**: In the California Housing dataset, the MedInc feature has a large positive weight. This means:
>
> A. Higher median income in a block group is associated with higher house values
>
> B. Median income causes house values to increase  
> C. Median income is the only important feature
>
> D. The model is overfitting on median income

<details><summary>Click to reveal answer</summary>
**Answer: A**

Correlation does not imply causation. A large positive weight indicates a strong positive association, but we cannot conclude causation from this alone. The model considers all features together.
</details>

## Summary and Best Practices

### Key Takeaways

1. **Linear Regression is parametric and efficient**
   - Learns fixed weights during training
   - Fast prediction (just matrix multiplication)
   - Closed-form solution via normal equation

2. **Always scale your features**
   - Fit scaler on training data ONLY
   - Transform all sets using same parameters
   - Essential for regularization and numerical stability

3. **Use proper train/validation/test splits**
   - Training: Fit the model
   - Validation: Tune hyperparameters/compare models
   - Test: Final evaluation (touch once!)

4. **Understand bias-variance tradeoff**
   - Simple models (low degree) → high bias (underfit)
   - Complex models (high degree) → high variance (overfit)
   - Find the sweet spot using validation set

5. **Analyze residuals**
   - Check for patterns in residual plots
   - Validate assumptions (normality, homoscedasticity)
   - Identify areas where model struggles

### When to Use Linear Regression

✅ **Good for:**
- Relationships that are approximately linear
- When interpretability is important
- Baseline model for comparison
- Large datasets (very efficient)

❌ **Not ideal for:**
- Highly non-linear relationships (without feature engineering)
- Data with many outliers (consider robust regression)
- When features are highly collinear (consider Ridge/Lasso)

### Linear Regression vs KNN Regression

| Aspect | Linear Regression | KNN Regression |
|--------|------------------|----------------|
| **Model Type** | Parametric (learns weights) | Non-parametric (instance-based) |
| **Training Time** | Fast (closed-form) | Instant (lazy learner) |
| **Prediction Time** | Very fast O(d) | Slower O(Nd) |
| **Memory** | Stores only weights | Stores all training data |
| **Assumes** | Linear relationship | Local similarity |
| **Interpretability** | High (can analyze weights) | Low (black box) |
| **Handles Non-linearity** | Requires feature engineering | Naturally handles it |
| **Outlier Sensitivity** | High | Medium (depends on K) |

### Best Practices Checklist

- ✅ Always split data into train/validation/test
- ✅ Standardize features (fit on train, transform all)
- ✅ Analyze residuals to validate assumptions
- ✅ Check for multicollinearity (VIF scores)
- ✅ Compare with baseline models
- ✅ Use cross-validation for robust evaluation
- ✅ Interpret feature weights (if features are scaled)
- ✅ Watch for data leakage (never fit on validation/test)
- ✅ Consider regularization (Ridge/Lasso) for many features
- ✅ Evaluate on multiple metrics (RMSE, R², MAE)

> **Final Question**: You trained a Linear Regression model and achieved R² = 0.85 on validation set. What does this mean?
>
> A. The model explains 85% of the variance in the target variable  
> B. The model is 85% accurate  
> C. 85% of predictions are correct  
> D. The RMSE is 0.85
>
> <details>
> <summary>Click to reveal answer</summary>
> <b>Answer: A</b><br>
> R² = 0.85 means the model explains 85% of the variance in the target variable. The remaining 15% is due to irreducible error (noise) or unmodeled patterns. R² is not the same as accuracy (which is for classification) or RMSE.
</details>

## Congratulations!

You've completed the Linear Regression Hands-On Lab! You now understand:

- ✅ The mathematics behind Linear Regression
- ✅ How to implement it from scratch
- ✅ The importance of feature scaling
- ✅ How to properly split and evaluate models
- ✅ Bias-variance tradeoff concepts
- ✅ Residual analysis and diagnostics
- ✅ When to use Linear Regression vs other algorithms

**Next Steps:**
- Explore regularization (Ridge and Lasso regression)
- Learn about polynomial regression for non-linear relationships
- Study logistic regression for classification tasks
- Practice with different datasets to build intuition