<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Linear%20Regression/Linear%20Regression%20Hands-On%20Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression Hands-On Lab

In this lab, you will implement a Linear Regression model from scratch, understand the mathematics behind it, and apply it to real data. Along the way, you'll answer conceptual questions and create visualizations to deepen your understanding.

**Learning Objectives:**
- Understand the mathematics of Linear Regression
- Implement a custom Linear Regression class from scratch
- Visualize regression lines and prediction errors
- Apply feature scaling and understand its importance
- Tune models using train/validation/test splits
- Compare custom implementation with scikit-learn
- Analyze model performance and errors

## Overview of Linear Regression

Linear Regression is a **parametric supervised learning algorithm** used for **regression tasks** (predicting continuous values).

**Key Idea:**
- Find a **linear function** that best fits the training data
- Model: **y = w₁x + w₀** (for single feature) or **y = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ** (for multiple features)
- **w₀** is the intercept (bias term)
- **w₁, w₂, ..., wₙ** are the slopes (weights)

**How it works:**
1. **Training:** Find optimal weights **w** using the **normal equation**: **w = (ΦᵀΦ)⁻¹Φᵀy**
2. **Prediction:** Compute **ŷ = Φw** where Φ is the design matrix with bias column

**Advantages:**
- Simple and interpretable
- Fast training (closed-form solution)
- Fast prediction
- Works well when relationships are linear

**Disadvantages:**
- Assumes linear relationship
- Sensitive to outliers
- Can't capture complex non-linear patterns (without feature engineering)
- Matrix inversion can be computationally expensive for very large datasets

> **Question**: Linear Regression finds the best fit line by:
>
> A. Computing the closed-form solution using the normal equation w = (ΦᵀΦ)⁻¹Φᵀy
>
> B. Iteratively searching through all possible lines until convergence
>
> C. Using gradient descent to minimize the loss function over many epochs
>
> D. Randomly initializing weights and selecting the best performing set

<details><summary>Click to reveal answer</summary>

**Correct Answer: A**

**Explanation:**
- **A is TRUE**: Linear Regression with the normal equation computes the optimal weights directly using w = (ΦᵀΦ)⁻¹Φᵀy. This is a closed-form solution that finds the global optimum in one calculation, without any iterative training.
- **B is FALSE**: There's no exhaustive search through all possible lines. With infinite possible lines, this would be computationally impossible. The normal equation mathematically derives the optimal solution.
- **C is FALSE**: While gradient descent CAN be used for Linear Regression (especially with large datasets), the standard implementation uses the normal equation which requires no iterations or epochs.
- **D is FALSE**: Linear Regression is deterministic, not random. The normal equation guarantees finding the exact optimal weights without any random initialization or trial-and-error.

**Key Insight**: The normal equation is what makes Linear Regression "fast to train" - it's a single matrix operation, unlike iterative optimization methods.

</details>

## Model Complexity: From Simple to Complex

While basic Linear Regression fits a straight line, we can create more complex models using **polynomial features**.

**Examples:**
- **Degree 1 (Linear):** y = w₀ + w₁x
- **Degree 2 (Quadratic):** y = w₀ + w₁x + w₂x²
- **Degree 3 (Cubic):** y = w₀ + w₁x + w₂x² + w₃x³

As model complexity increases:
- **Low complexity (underfit):** Model is too simple, high bias, misses patterns
- **Right complexity:** Model generalizes well, balanced bias-variance
- **High complexity (overfit):** Model is too complex, high variance, memorizes noise

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate sample data
np.random.seed(42)
X = np.linspace(-3, 3, 50).reshape(-1, 1)
y = 0.5 * X.ravel()**2 + X.ravel() + 2 + np.random.normal(0, 1, 50)

# Create figure with subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
degrees = [1, 2, 9]
titles = ['Underfitting (Degree 1)', 'Good Fit (Degree 2)', 'Overfitting (Degree 9)']

for ax, degree, title in zip(axes, degrees, titles):
    # Fit polynomial regression
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly.fit_transform(X)
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # Plot
    X_plot = np.linspace(-3, 3, 200).reshape(-1, 1)
    X_plot_poly = poly.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    
    ax.scatter(X, y, c='lightblue', s=50, edgecolors='black', label='Data')
    ax.plot(X_plot, y_plot, 'r-', linewidth=2, label=f'Degree {degree}')
    ax.set_xlabel('x', fontsize=12)
    ax.set_ylabel('y', fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Bias-Variance Trade-off in Linear Regression

**Bias** = Error from overly simplistic assumptions
- High bias → Underfitting → Model too simple
- Example: Fitting a line to quadratic data

**Variance** = Error from sensitivity to small fluctuations in training data
- High variance → Overfitting → Model too complex
- Example: High-degree polynomial that memorizes noise

**Goal:** Find the sweet spot that minimizes **total error = bias² + variance + irreducible error**

> **Question**: A linear regression model with polynomial degree 9 has very low training error (MSE = 0.05) but high validation error (MSE = 2.50). This indicates:
>
> A. High bias (underfitting) - the model is too simple
>
> B. High variance (overfitting) - the model is too complex
>
> C. Perfect fit - the model generalizes well
>
> D. The validation set is too small to evaluate properly

<details><summary>Click to reveal answer</summary>

**Correct Answer: B**

**Explanation:**
- **A is FALSE**: High bias (underfitting) would cause BOTH training and validation errors to be high. Here, training error is very low (0.05), so the model is definitely not too simple.
- **B is TRUE**: The large gap between training error (0.05) and validation error (2.50) is the classic signature of overfitting. The degree-9 polynomial has memorized the training data (including noise) but fails to generalize to new data. This is high variance - the model is too complex.
- **C is FALSE**: A well-generalizing model would have similar training and validation errors. The 50× gap (0.05 vs 2.50) shows severe overfitting, not good generalization.
- **D is FALSE**: While validation set size matters, a 50× error gap is far too large to be explained by sampling variance alone. This clearly indicates a model complexity problem.

**Key Insight**: When training error is much lower than validation error, you have overfitting (high variance). When BOTH are high, you have underfitting (high bias).

</details>

## Feature Scaling in Linear Regression

While Linear Regression's predictions aren't affected by feature scaling (unlike KNN), scaling is still important because:

1. **Numerical Stability:** Computing (ΦᵀΦ)⁻¹ can be numerically unstable when features have very different scales
2. **Regularization:** If using Ridge or Lasso regression, unscaled features will be penalized unfairly
3. **Optimization:** When using gradient descent instead of the normal equation, convergence is much faster with scaled features
4. **Interpretability:** Standardized coefficients can be compared to determine feature importance

**Best Practice:** Always scale features for regression tasks!

## Z-Score Standardization

### The Z-Score Formula

$$z = \frac{x - \mu}{\sigma}$$

Where:
- **x** = original value
- **μ** = mean of the feature
- **σ** = standard deviation of the feature
- **z** = standardized value

### What This Does
- Centers data around 0 (mean = 0)
- Scales to unit variance (std = 1)
- Preserves the distribution shape
- Makes features comparable

### Critical Rule for Train/Validation/Test Splits
```python
# 1. Fit scaler on training data ONLY
scaler = StandardScaler()
scaler.fit(X_train)  # Compute μ and σ from training data

# 2. Transform all sets using SAME parameters
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)      # Use training μ and σ
X_test_scaled = scaler.transform(X_test)    # Use training μ and σ
```

**Why?** To prevent **data leakage** - the model should not have any information about validation/test sets during training!

> **Question**: After properly fitting a StandardScaler on training data and transforming all sets, what should be TRUE about the scaled training features?
>
> A. All features should have values between 0 and 1
>
> B. All features should have the same variance as the original data
>
> C. All features should have approximately mean = 0 and standard deviation = 1
>
> D. All features should be normalized to have the same range as the target variable

<details><summary>Click to reveal answer</summary>

**Correct Answer: C**

**Explanation:**
- **A is FALSE**: StandardScaler uses z-score normalization, which transforms data to have mean=0 and std=1. Values can be negative and exceed 1 (e.g., outliers might be ±3 or more). Min-max scaling (MinMaxScaler) produces [0,1] ranges, not StandardScaler.
- **B is FALSE**: The entire purpose of standardization is to CHANGE the variance to 1. The original variance could have been anything (e.g., 100, 0.01, etc.), but after scaling, all features have variance ≈ 1.
- **C is TRUE**: StandardScaler applies z = (x - μ)/σ to each feature, which centers the data at mean=0 and scales to std=1. This is exactly what z-score standardization does.
- **D is FALSE**: Feature scaling is independent of the target variable. StandardScaler only looks at the feature distributions, not the target. Features and target can have completely different scales.

**Key Insight**: StandardScaler (z-score) creates mean=0, std=1. MinMaxScaler creates values in [0,1]. They serve different purposes!

</details>

## Pseudocode for Linear Regression

### Formal Pseudocode

```
============================================
Inputs
============================================
X       ← training features (N × d matrix)
y       ← training targets (N × 1 vector)
X_query ← examples to predict

============================================
----- fit -----
============================================
1. Store X and y
2. Add bias column: Φ ← [1, X]  # (N × (d+1))
3. Compute weights using normal equation:
   w ← (ΦᵀΦ)⁻¹Φᵀy
4. Store w

============================================
----- predict -----
============================================
For each query point in X_query:
1. Add bias: Φ_query ← [1, X_query]
2. Compute prediction: ŷ ← Φ_query · w
3. Return ŷ
```

### Key Observations
- **No iterations needed:** One-step solution via normal equation
- **Fast prediction:** Just matrix multiplication
- **Memory efficient:** Only stores weights (not all training data like KNN)
- **Global model:** Learns one function for entire space (unlike local KNN)

## Implementing a Custom Linear Regression Class

Below is a scaffold of the `MyLinearRegressor` class. Fill in the TODO sections to complete the implementation:

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin

class MyLinearRegressor(BaseEstimator, RegressorMixin):
    """
    Custom Linear Regression implementation using the normal equation.
    
    Parameters:
    -----------
    None
    
    Attributes:
    -----------
    weights_ : array of shape (n_features + 1,)
        Learned weights including bias term
    """
    
    def __init__(self):
        pass
    
    def fit(self, X, y):
        """
        Fit the linear regression model using the normal equation.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training data
        y : array-like of shape (n_samples,)
            Target values
        
        Returns:
        --------
        self
        """
        # TODO: Add column of ones for bias term
        # Hint: Use np.c_[np.ones(len(X)), X] to create design matrix Phi
        Phi = ___
        
        # TODO: Compute weights using normal equation: w = (Phi^T Phi)^{-1} Phi^T y
        # Hint: Use @ for matrix multiplication, .T for transpose, np.linalg.inv() for inverse
        self.weights_ = ___
        
        return self
    
    def predict(self, X):
        """
        Predict using the linear model.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Samples to predict
        
        Returns:
        --------
        y_pred : array of shape (n_samples,)
            Predicted values
        """
        # TODO: Add column of ones for bias term
        Phi = ___
        
        # TODO: Compute predictions: y_pred = Phi @ weights
        y_pred = ___
        
        return y_pred

### Test Your Implementation

Once you have filled in the implementation, let's test our custom regressor on a simple dataset to ensure it works as expected.

In [None]:
# Create simple test data
np.random.seed(42)
X_simple = np.array([[1], [2], [3], [4], [5]])
y_simple = np.array([2, 4, 6, 8, 10])  # Perfect linear relationship: y = 2x

# Fit model
model = MyLinearRegressor()
model.fit(X_simple, y_simple)

# Make predictions
predictions = model.predict(X_simple)

print("Learned weights (w0=intercept, w1=slope):", model.weights_)
print("Expected: [0, 2] or very close to it")
print("\nPredictions:", predictions)
print("Actual:     ", y_simple)
print("\nMean Squared Error:", np.mean((predictions - y_simple)**2))
print("Expected: ~0 (perfect fit)")

> **Question**: Unlike algorithms such as KNN, Linear Regression has very fast prediction time even with large datasets. Why?
>
> A. Linear Regression learns a parametric model (weights) during training. Prediction is just a simple matrix multiplication, regardless of training set size.
>
> B. Linear Regression stores a compressed version of the training data that requires less computation
>
> C. Linear Regression uses the normal equation which caches distances to training points
>
> D. Linear Regression only uses the K most important training examples for each prediction

<details><summary>Click to reveal answer</summary>

**Correct Answer: A**

**Explanation:**
- **A is TRUE**: Linear Regression is a parametric model that learns fixed weights w during training. At prediction time, it computes ŷ = Φw (matrix multiplication), which is O(d) where d is the number of features. The training set size N is irrelevant at prediction time because the model doesn't reference training data - only the learned weights.
- **B is FALSE**: Linear Regression doesn't store any training data (compressed or otherwise) after training. It only stores the weight vector w, which has size (d+1) regardless of how many training examples (N) were used.
- **C is FALSE**: The normal equation w = (ΦᵀΦ)⁻¹Φᵀy is only used during training to compute weights. It doesn't cache distances or store anything for prediction time. Linear Regression doesn't use distance computations at all.
- **D is FALSE**: This describes KNN behavior, not Linear Regression. Linear Regression uses ALL training data to learn weights during training, but uses NO training examples during prediction - it only uses the learned weights.

**Key Insight**: Parametric models (Linear Regression) learn fixed parameters → O(d) prediction. Non-parametric models (KNN) reference training data → O(Nd) prediction.

</details>

## A Dataset for Visualization

Let's work with the same synthetic dataset from the Code Walk Through to visualize how Linear Regression works.

In [None]:
# Generate the same data as in Code Walk Through
np.random.seed(42)
X_train = np.arange(-9.5, 8.5, 0.1).reshape(-1, 1)
y_train = X_train.ravel() + 1 + np.random.normal(0, 2, len(X_train))

print(f"Training data: {len(X_train)} points")
print(f"X range: [{X_train.min():.1f}, {X_train.max():.1f}]")
print(f"y range: [{y_train.min():.1f}, {y_train.max():.1f}]")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, edgecolors='black', linewidths=0.5)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Training Data: Linear Relationship with Noise', fontsize=16)
plt.grid(True, alpha=0.3)
plt.show()

## Training and Visualizing the Model

In [None]:
# TODO: Fit your MyLinearRegressor on the training data
# Hint: model = MyLinearRegressor()
#       model.fit(X_train, y_train)

model = ___
model.fit(___, ___)

print(f"Learned weights: {model.weights_}")
print(f"Model equation: y = {model.weights_[1]:.3f}x + {model.weights_[0]:.3f}")

In [None]:
# Visualize the fit
x_line = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
y_line = model.predict(x_line)

plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, edgecolors='black', linewidths=0.5, label='Training data')
plt.plot(x_line, y_line, 'r-', linewidth=2, label=f'Best fit: y={model.weights_[1]:.2f}x+{model.weights_[0]:.2f}')
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Linear Regression: Best Fit Line', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

## Understanding Predictions: Step-by-Step

Let's visualize how the model makes a prediction for a single test point.

In [None]:
# TODO: Make a prediction for x = 5.0
# Hint: X_test = np.array([[5.0]])
#       y_pred = model.predict(X_test)

X_test = np.array([[___]])
y_pred = model.predict(___)

print(f"For x = {X_test[0, 0]:.1f}:")
print(f"Predicted y = {y_pred[0]:.3f}")
print(f"Calculation: y = {model.weights_[1]:.3f} × {X_test[0, 0]:.1f} + {model.weights_[0]:.3f} = {y_pred[0]:.3f}")

In [None]:
# Visualize the prediction
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, c='lightblue', alpha=0.6, label='Training data')
plt.plot(x_line, y_line, 'k-', linewidth=2, label='Best fit line')
plt.scatter(X_test, y_pred, c='red', s=200, marker='*', edgecolors='black', linewidths=2, 
           label=f'Prediction: x={X_test[0,0]:.1f}, ŷ={y_pred[0]:.2f}', zorder=5)
plt.plot([X_train.min(), X_test[0,0]], [y_pred[0], y_pred[0]], 'r--', alpha=0.5, linewidth=1)
plt.plot([X_test[0,0], X_test[0,0]], [y_train.min(), y_pred[0]], 'r--', alpha=0.5, linewidth=1)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.title('Linear Regression: Making a Prediction', fontsize=16)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

## Regression Metrics: MSE and RMSE

### Mean Squared Error (MSE)

$$\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

- Measures average squared difference between actual and predicted values
- Units are squared (e.g., if y is in dollars, MSE is in dollars²)
- Heavily penalizes large errors (due to squaring)

### Root Mean Squared Error (RMSE)

$$\text{RMSE} = \sqrt{\text{MSE}}$$

- Same units as the target variable
- Easier to interpret (e.g., "average error of $5000")
- Commonly used for regression evaluation

### R² Score (Coefficient of Determination)

$$R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}$$

- Ranges from -∞ to 1 (1 is perfect, 0 means model is no better than predicting mean)
- Represents proportion of variance explained by the model
- Scale-independent (can compare across different datasets)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Compute metrics on training data
y_train_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_train_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_train, y_train_pred)

print(f"Training Metrics:")
print(f"  MSE:  {mse:.3f}")
print(f"  RMSE: {rmse:.3f}")
print(f"  R²:   {r2:.3f}")
print(f"\nInterpretation: The model's predictions are on average {rmse:.2f} units away from actual values.")

## Working with a Real Dataset: California Housing

Now let's apply Linear Regression to a real-world dataset. We'll use the California Housing dataset, which contains information about California districts and median house values.

**Dataset Features:**
- MedInc: Median income in block group
- HouseAge: Median house age in block group  
- AveRooms: Average number of rooms per household
- AveBedrms: Average number of bedrooms per household
- Population: Block group population
- AveOccup: Average number of household members
- Latitude: Block group latitude
- Longitude: Block group longitude

**Target:** Median house value (in $100,000s)

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Create DataFrame for better visualization
df = pd.DataFrame(X, columns=housing.feature_names)
df['MedHouseVal'] = y

print(f"Dataset shape: {X.shape}")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nDataset statistics:")
print(df.describe())

Notice how features have **very different scales:**
- MedInc ranges from ~0.5 to ~15
- Population ranges from ~3 to ~35,000+
- Latitude/Longitude are coordinates

This is why feature scaling is important!

## Splitting into Train, Validation, and Test Sets

**Why 3 splits?**
- **Training set (60%):** Fit the model
- **Validation set (20%):** Tune hyperparameters / compare models
- **Test set (20%):** Final evaluation (touch only once!)

**Important:** Test set simulates real-world unseen data. Never use it for model selection!

In [None]:
from sklearn.model_selection import train_test_split

# TODO: Split into Train (60%), Validation (20%), Test (20%)
# Hint: First split into Train (60%) and Temp (40%)
#       Then split Temp into equal halves for Validation and Test
#       Use random_state=42 for reproducibility

# Step 1: Split into Train (60%) and Temp (40%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=___, random_state=___)

# Step 2: Split Temp into Validation (50% of 40% = 20%) and Test (50% of 40% = 20%)
X_val, X_test, y_val, y_test = train_test_split(___, ___, test_size=___, random_state=___)

print(f"Training set:   {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X)*100:.0f}%)")
print(f"Test set:       {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")

## Feature Scaling: Comparing Unscaled vs Scaled

### 1. Linear Regression on Unscaled Features

In [None]:
# TODO: Fit MyLinearRegressor on unscaled training data
# TODO: Predict on validation set
# TODO: Compute RMSE and R² scores

model_unscaled = ___
model_unscaled.fit(___, ___)

y_val_pred_unscaled = model_unscaled.predict(___)

rmse_unscaled = np.sqrt(mean_squared_error(___, ___))
r2_unscaled = r2_score(___, ___)

print("Performance on UNSCALED features:")
print(f"  Validation RMSE: {rmse_unscaled:.4f}")
print(f"  Validation R²:   {r2_unscaled:.4f}")

### 2. Linear Regression on Scaled Features

In [None]:
from sklearn.preprocessing import StandardScaler

# TODO: Create and fit StandardScaler on training data ONLY
# TODO: Transform train and validation sets
# Hint: scaler = StandardScaler()
#       scaler.fit(X_train)
#       X_train_scaled = scaler.transform(X_train)
#       X_val_scaled = scaler.transform(X_val)

scaler = ___
scaler.fit(___)

X_train_scaled = scaler.transform(___)
X_val_scaled = scaler.transform(___)

print("Scaled training data statistics:")
print(f"  Mean: {X_train_scaled.mean(axis=0)}")
print(f"  Std:  {X_train_scaled.std(axis=0)}")

In [None]:
# TODO: Fit MyLinearRegressor on scaled training data
# TODO: Predict on scaled validation set
# TODO: Compute RMSE and R² scores

model_scaled = ___
model_scaled.fit(___, ___)

y_val_pred_scaled = model_scaled.predict(___)

rmse_scaled = np.sqrt(mean_squared_error(___, ___))
r2_scaled = r2_score(___, ___)

print("Performance on SCALED features:")
print(f"  Validation RMSE: {rmse_scaled:.4f}")
print(f"  Validation R²:   {r2_scaled:.4f}")

### Comparison

In [None]:
print("\n" + "="*50)
print("COMPARISON: Unscaled vs Scaled Features")
print("="*50)
print(f"Unscaled - RMSE: {rmse_unscaled:.4f} | R²: {r2_unscaled:.4f}")
print(f"Scaled   - RMSE: {rmse_scaled:.4f} | R²: {r2_scaled:.4f}")
print("\nNote: For Linear Regression with normal equation, scaling doesn't change predictions.")
print("However, it improves numerical stability and is essential for regularization!")

> **Question**: You're evaluating a Linear Regression model and want to ensure your performance estimates are unbiased and reflect real-world generalization. Which practice is MOST important?
>
> A. Using the largest possible training set to maximize model performance
>
> B. Tuning hyperparameters directly on the test set to find the best configuration
>
> C. Keeping the test set completely unseen until final evaluation, using a separate validation set for model selection
>
> D. Fitting the StandardScaler on all data (train + validation + test) before splitting

<details><summary>Click to reveal answer</summary>

**Correct Answer: C**

**Explanation:**
- **A is FALSE**: While a larger training set generally helps, this alone doesn't ensure unbiased evaluation. If you don't properly separate validation/test sets or if you leak information, performance estimates will be overly optimistic regardless of training set size.
- **B is FALSE**: This is a critical error that causes data leakage! If you tune hyperparameters on the test set, you're indirectly "training" on it. Your test performance will be overly optimistic and won't reflect true generalization to new data. This defeats the purpose of having a test set.
- **C is TRUE**: The test set must remain completely unseen until final evaluation to provide an unbiased estimate of real-world performance. Use the validation set for hyperparameter tuning and model selection, reserve the test set for the final evaluation only once.
- **D is FALSE**: This causes data leakage! Fitting the scaler on test data means your model has "seen" test set statistics (mean, std). Always fit the scaler ONLY on training data, then transform validation/test sets using those learned parameters.

**Key Insight**: Proper train/validation/test separation is crucial. Any information from validation/test sets that influences training or model selection will inflate performance estimates.

</details>

## Comparing with Scikit-Learn's Implementation

Let's verify our implementation matches scikit-learn's LinearRegression.

In [None]:
from sklearn.linear_model import LinearRegression as SklearnLR

# Fit sklearn model
sklearn_model = SklearnLR()
sklearn_model.fit(X_train_scaled, y_train)
y_val_pred_sklearn = sklearn_model.predict(X_val_scaled)

# Compute metrics
rmse_sklearn = np.sqrt(mean_squared_error(y_val, y_val_pred_sklearn))
r2_sklearn = r2_score(y_val, y_val_pred_sklearn)

print("\n" + "="*60)
print("COMPARISON: Custom Implementation vs Scikit-Learn")
print("="*60)
print(f"Custom MyLinearRegressor - RMSE: {rmse_scaled:.4f} | R²: {r2_scaled:.4f}")
print(f"Sklearn LinearRegression - RMSE: {rmse_sklearn:.4f} | R²: {r2_sklearn:.4f}")
print(f"\nDifference in RMSE: {abs(rmse_scaled - rmse_sklearn):.6f}")
print(f"Difference in R²:   {abs(r2_scaled - r2_sklearn):.6f}")
print("\n✓ Results should match (within numerical precision)!")

## Visualizing Predictions vs Actual Values

In [None]:
# Create scatter plot of predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_val, y_val_pred_scaled, alpha=0.5, edgecolors='black', linewidths=0.5)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'r--', linewidth=2, label='Perfect predictions')
plt.xlabel('Actual Values', fontsize=14)
plt.ylabel('Predicted Values', fontsize=14)
plt.title(f'Predictions vs Actual (R² = {r2_scaled:.3f})', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.show()

print("Points close to the red line indicate accurate predictions.")
print("Scatter away from the line shows prediction errors.")

## Understanding Prediction Errors: Residual Analysis

In [None]:
# Compute residuals (errors)
residuals = y_val - y_val_pred_scaled

# Create residual plot
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Residuals vs Predicted Values
axes[0].scatter(y_val_pred_scaled, residuals, alpha=0.5, edgecolors='black', linewidths=0.5)
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_xlabel('Predicted Values', fontsize=12)
axes[0].set_ylabel('Residuals (Actual - Predicted)', fontsize=12)
axes[0].set_title('Residual Plot', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Plot 2: Histogram of Residuals
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residuals', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Residuals', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Good residual plots should show:")
print("  1. Residuals randomly scattered around 0 (no patterns)")
print("  2. Approximately normal distribution")
print("  3. Constant variance across all predicted values (homoscedasticity)")

## Impact of Outliers on Linear Regression

Linear Regression is **sensitive to outliers** because it uses the **sum of squared errors** as its objective function. Outliers can significantly distort the regression line by pulling it towards themselves, leading to:

- **Biased coefficient estimates**: The model tries to minimize error for all points, including outliers
- **Reduced model accuracy**: The regression line may not represent the true relationship for most data points
- **Poor generalization**: The model fits extreme points rather than the overall trend

**Key Principle**: Our ML model should perform well in most cases. Therefore, outliers should often be detected and handled appropriately (removed, capped, or modeled separately) before training.

Let's demonstrate the impact of outliers with a visual example:

In [None]:
# Generate clean data with a linear relationship
np.random.seed(42)
X_clean = np.linspace(0, 10, 50).reshape(-1, 1)
y_clean = 2 * X_clean.ravel() + 1 + np.random.normal(0, 1, 50)

# Add an outlier
X_with_outlier = np.vstack([X_clean, [[8.0]]])
y_with_outlier = np.append(y_clean, [5.0])  # This point is far below the trend

# Fit models
model_clean = MyLinearRegressor()
model_clean.fit(X_clean, y_clean)

model_with_outlier = MyLinearRegressor()
model_with_outlier.fit(X_with_outlier, y_with_outlier)

# Create predictions for plotting
X_plot = np.linspace(0, 10, 100).reshape(-1, 1)
y_plot_clean = model_clean.predict(X_plot)
y_plot_with_outlier = model_with_outlier.predict(X_plot)

# Visualize the impact
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot without outlier
ax1.scatter(X_clean, y_clean, alpha=0.6, edgecolors='black', linewidths=0.5, label='Training data')
ax1.plot(X_plot, y_plot_clean, 'r-', linewidth=2, 
         label=f'Fit: y={model_clean.weights_[1]:.2f}x+{model_clean.weights_[0]:.2f}')
ax1.set_xlabel('X', fontsize=12)
ax1.set_ylabel('y', fontsize=12)
ax1.set_title('Linear Regression WITHOUT Outlier', fontsize=14)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot with outlier
ax2.scatter(X_clean, y_clean, alpha=0.6, edgecolors='black', linewidths=0.5, label='Training data')
ax2.scatter([8.0], [5.0], color='red', s=200, marker='*', edgecolors='black', 
           linewidths=2, label='Outlier', zorder=5)
ax2.plot(X_plot, y_plot_clean, 'g--', linewidth=2, alpha=0.5, label='Original fit (without outlier)')
ax2.plot(X_plot, y_plot_with_outlier, 'r-', linewidth=2, 
         label=f'Fit WITH outlier: y={model_with_outlier.weights_[1]:.2f}x+{model_with_outlier.weights_[0]:.2f}')
ax2.set_xlabel('X', fontsize=12)
ax2.set_ylabel('y', fontsize=12)
ax2.set_title('Linear Regression WITH Outlier', fontsize=14)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nImpact of the outlier:")
print(f"  Slope changed from {model_clean.weights_[1]:.3f} to {model_with_outlier.weights_[1]:.3f}")
print(f"  Intercept changed from {model_clean.weights_[0]:.3f} to {model_with_outlier.weights_[0]:.3f}")
print(f"\nThe regression line is 'pulled' toward the outlier, affecting predictions for all points!")

> **Question**: You notice one data point in your training set has an extremely high target value compared to similar inputs. What is the likely impact on your linear regression model?
>
> A. No impact - linear regression automatically detects and ignores outliers during training
>
> B. The regression line will be pulled toward the outlier, potentially degrading predictions for typical data points
>
> C. The model will generalize better because it learns to handle extreme cases
>
> D. Only the intercept will be affected, not the slope coefficients

<details><summary>Click to reveal answer</summary>

**Correct Answer: B**

**Explanation:**
- **A is FALSE**: Linear regression has NO built-in outlier detection or robust loss function. The normal equation w = (ΦᵀΦ)⁻¹Φᵀy treats all points equally in the mathematical derivation. Unlike some algorithms (e.g., RANSAC or Huber regression), standard Linear Regression doesn't identify or downweight outliers.
- **B is TRUE**: Linear regression minimizes the sum of squared errors: Σ(yᵢ - ŷᵢ)². Because errors are squared, outliers with large residuals contribute disproportionately to the loss (e.g., error of 10 contributes 100 to the loss vs. error of 1 contributing 1). The model shifts the regression line to reduce the outlier's massive error, which worsens predictions for the majority of normal points.
- **C is FALSE**: Outliers are typically noise or data quality issues, not valuable extreme cases. The model "learning" from an outlier means it's memorizing noise, which hurts generalization. If the outlier appears due to measurement error or data entry mistake, the model is fitting invalid data.
- **D is FALSE**: Both intercept (w₀) and slope coefficients (w₁, w₂, ...) are affected by outliers. The entire regression hyperplane can rotate and shift. The normal equation computes all weights simultaneously, and an outlier influences the entire (ΦᵀΦ)⁻¹Φᵀy calculation.

**Key Insight**: Linear Regression's squared loss makes it highly sensitive to outliers. Use robust regression methods (Huber, RANSAC) or remove outliers before training for better performance on typical data.

</details>

## Final Evaluation on Test Set

Now that we've finalized our approach (using scaled features), let's evaluate on the test set **one time only**.

In [None]:
# TODO: Transform test set using the SAME scaler fitted on training data
# TODO: Predict on test set
# TODO: Compute final RMSE and R² scores

X_test_scaled = scaler.transform(___)

y_test_pred = model_scaled.predict(___)

test_rmse = np.sqrt(mean_squared_error(___, ___))
test_r2 = r2_score(___, ___)

print("\n" + "="*60)
print("FINAL TEST SET EVALUATION")
print("="*60)
print(f"Test RMSE: {test_rmse:.4f}")
print(f"Test R²:   {test_r2:.4f}")
print(f"\nInterpretation: Our model's predictions are on average")
print(f"{test_rmse:.2f} × $100,000 = ${test_rmse*100000:.0f} away from actual house values.")

## Feature Importance Analysis

Since we standardized our features, we can compare the magnitude of learned weights to understand feature importance.

In [None]:
# Extract feature weights (excluding bias)
feature_weights = model_scaled.weights_[1:]  # Skip bias term

# Create DataFrame for visualization
importance_df = pd.DataFrame({
    'Feature': housing.feature_names,
    'Weight': feature_weights
}).sort_values('Weight', key=abs, ascending=False)

# Plot
plt.figure(figsize=(10, 6))
colors = ['green' if x > 0 else 'red' for x in importance_df['Weight']]
plt.barh(importance_df['Feature'], importance_df['Weight'], color=colors, edgecolor='black')
plt.xlabel('Standardized Weight', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance (Standardized Weights)', fontsize=14)
plt.axvline(x=0, color='black', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nFeature Importance:")
print(importance_df.to_string(index=False))
print("\nGreen = Positive correlation (↑ feature → ↑ house value)")
print("Red   = Negative correlation (↑ feature → ↓ house value)")

> **Question**: In the California Housing dataset with standardized features, you observe these weights: MedInc: +0.82, Latitude: -0.15, HouseAge: +0.05. What is the MOST appropriate interpretation?
>
> A. Median income causes house values to increase by $82,000
>
> B. Latitude has minimal impact because the weight is negative
>
> C. HouseAge should be removed from the model because its weight is small
>
> D. Median income has the strongest association with house values among these three features

<details><summary>Click to reveal answer</summary>

**Correct Answer: D**

**Explanation:**
- **A is FALSE**: Correlation ≠ causation. A weight indicates association, not causation. Also, with standardized features, weights represent the change in target per 1 standard deviation change in the feature, not per unit change. You cannot directly convert the weight to a dollar amount without considering the feature's scale and the causality question.
- **B is FALSE**: The magnitude matters more than the sign. |−0.15| > |+0.05|, so Latitude has more impact than HouseAge. The negative sign just means inverse relationship: as Latitude increases (moving north in California), house values tend to decrease (makes sense - San Diego area vs. rural northern areas).
- **C is FALSE**: Small weight doesn't automatically mean the feature should be removed. HouseAge might still provide value and removing it could hurt generalization. Feature selection should be based on validation performance, not just weight magnitude. Also, even small weights can be statistically significant.
- **D is TRUE**: With standardized features, weight magnitudes are directly comparable. |0.82| > |−0.15| > |0.05| indicates MedInc has the strongest linear association with house values. When features are on the same scale (mean=0, std=1), larger absolute weights indicate stronger relationships.

**Key Insight**: Standardized weights allow direct comparison of feature importance. But remember: large weight = strong association, not causation!

</details>

## Summary and Best Practices

### Key Takeaways

1. **Linear Regression is parametric and efficient**
   - Learns fixed weights during training
   - Fast prediction (just matrix multiplication)
   - Closed-form solution via normal equation

2. **Always scale your features**
   - Fit scaler on training data ONLY
   - Transform all sets using same parameters
   - Essential for regularization and numerical stability

3. **Use proper train/validation/test splits**
   - Training: Fit the model
   - Validation: Tune hyperparameters/compare models
   - Test: Final evaluation (touch once!)

4. **Understand bias-variance tradeoff**
   - Simple models (low degree) → high bias (underfit)
   - Complex models (high degree) → high variance (overfit)
   - Find the sweet spot using validation set

5. **Analyze residuals**
   - Check for patterns in residual plots
   - Validate assumptions (normality, homoscedasticity)
   - Identify areas where model struggles

### When to Use Linear Regression

✅ **Good for:**
- Relationships that are approximately linear
- When interpretability is important
- Baseline model for comparison
- Large datasets (very efficient)

❌ **Not ideal for:**
- Highly non-linear relationships (without feature engineering)
- Data with many outliers (consider robust regression)
- When features are highly collinear (consider Ridge/Lasso)

### Linear Regression vs KNN Regression

| Aspect | Linear Regression | KNN Regression |
|--------|------------------|----------------|
| **Model Type** | Parametric (learns weights) | Non-parametric (instance-based) |
| **Training Time** | Fast (closed-form) | Instant (lazy learner) |
| **Prediction Time** | Very fast O(d) | Slower O(Nd) |
| **Memory** | Stores only weights | Stores all training data |
| **Assumes** | Linear relationship | Local similarity |
| **Interpretability** | High (can analyze weights) | Low (black box) |
| **Handles Non-linearity** | Requires feature engineering | Naturally handles it |
| **Outlier Sensitivity** | High | Medium (depends on K) |

### Best Practices Checklist

- ✅ Always split data into train/validation/test
- ✅ Standardize features (fit on train, transform all)
- ✅ Analyze residuals to validate assumptions
- ✅ Check for multicollinearity (VIF scores)
- ✅ Compare with baseline models
- ✅ Use cross-validation for robust evaluation
- ✅ Interpret feature weights (if features are scaled)
- ✅ Watch for data leakage (never fit on validation/test)
- ✅ Consider regularization (Ridge/Lasso) for many features
- ✅ Evaluate on multiple metrics (RMSE, R², MAE)

> **Final Question**: You have three Linear Regression models with validation results: Model A (R²=0.85, RMSE=0.52), Model B (R²=0.82, RMSE=0.48), Model C (R²=0.88, RMSE=0.61). Which model should you choose for deployment and why?
>
> A. Model A because it has the best balance of R² and RMSE
>
> B. Model C because it has the highest R² score
>
> C. They are all equivalent since R² and RMSE measure the same thing
>
> D. Model B because RMSE directly measures prediction error in the target's units, which matters more for real-world impact

<details><summary>Click to reveal answer</summary>

**Correct Answer: D**

**Explanation:**
- **A is FALSE**: While Model A has moderate scores on both metrics, "balance" isn't the goal - minimizing real-world prediction error is. Model B has the lowest RMSE (0.48), meaning its predictions are closest to actual values on average, which is what matters for deployment.
- **B is FALSE**: R² measures variance explained, but doesn't tell you the magnitude of errors in meaningful units. Model C has the highest R² (0.88) but also the highest RMSE (0.61), meaning its predictions are actually WORSE on average. A high R² with high RMSE can occur when the model captures the overall trend but makes large errors.
- **C is FALSE**: R² and RMSE measure different things. R² is scale-independent (compares model to baseline of predicting mean), while RMSE is in the target's units (e.g., dollars, years). You can have high R² with high RMSE if the target has high variance. They provide complementary information.
- **D is TRUE**: RMSE = 0.48 means predictions are off by 0.48 × $100,000 = $48,000 on average. Model B's RMSE of 0.48 is better than Model A's 0.52 ($52,000 error) and Model C's 0.61 ($61,000 error). For deployment, you want to minimize actual prediction errors (RMSE), not just maximize variance explained (R²).

**Key Insight**: For model selection, prioritize metrics that reflect real-world impact. RMSE in interpretable units (dollars, days, etc.) is often more actionable than R².

</details>

## Congratulations!

You've completed the Linear Regression Hands-On Lab! You now understand:

- ✅ The mathematics behind Linear Regression
- ✅ How to implement it from scratch
- ✅ The importance of feature scaling
- ✅ How to properly split and evaluate models
- ✅ Bias-variance tradeoff concepts
- ✅ Residual analysis and diagnostics
- ✅ When to use Linear Regression vs other algorithms

**Next Steps:**
- Explore regularization (Ridge and Lasso regression)
- Learn about polynomial regression for non-linear relationships
- Study logistic regression for classification tasks
- Practice with different datasets to build intuition