# Week 4 Lab: Model Preparation & Evaluation

Welcome to this week's lab on **Model Preparation and Evaluation**. We'll be covering the following key concepts:
- Train/Validation/Test Splits
- Avoiding Data Leakage
- Using Pipelines for cleaner workflows
- Model Evaluation Metrics (like MSE and R²)
- Hyperparameter Tuning with GridSearchCV

---

## Week 4 Challenge Lab

This lab is designed to give you more independence in applying what you've learned. You're encouraged to explore, test hypotheses, and compare models. Use the cells and prompts below to guide your process.

### Objective:
Build and evaluate a regression model using the California Housing dataset. Follow best practices for data preparation and model evaluation.


## Step 1: Load and Explore the Data
Use the California Housing dataset.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

## Step 2: Train/Test Split
Split the data into 80/20 using `train_test_split`.

**Task 1:**

WARNING! In the following text, there is a data leakage. Remember that data leakage happens when the model 'sees' the test data before the testing moment. **Find the part of the code where the data leakage happens and correct it. What Differences do you find before and after correcting the data leakage?**

In [9]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)

print("Test R² (with leakage):", r2_score(y_test, y_pred))
print("Test RMSE (with leakage):", np.sqrt(mean_squared_error(y_test, y_pred)))

Test R² (with leakage): 0.5757877060324508
Test RMSE (with leakage): 0.7455813830127764


Corrected data leakage:

In [18]:
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split first (before scaling)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale properly: fit only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)

print("Test R² (without leakage):", r2_score(y_test, y_pred))
print("Test RMSE (without leakage):", np.sqrt(mean_squared_error(y_test, y_pred)))

Test R² (without leakage): 0.575787706032451
Test RMSE (without leakage): 0.7455813830127762


Split the data into 60/20/20

In [6]:
# --- Train/Validation/Test Split ---

# Load dataset
housing = fetch_california_housing() #Reload for clean "unseen" data
X, y = housing.data, housing.target

# Step 1: Split into TRAIN + TEMP_TEST (80/20)
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Split TRAIN into TRAIN + VALIDATION (75/25 of original train)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.25, random_state=42)
# Now: 60% train, 20% val, 20% test

print("Shapes -> Train:", X_train.shape, "Validation:", X_val.shape, "Test:", X_test.shape)

Shapes -> Train: (12384, 8) Validation: (4128, 8) Test: (4128, 8)


## Step 3: Choose and Create a Pipeline
Try out StandardScaler + Ridge, Lasso, or KNeighborsRegressor. You can also add your own steps.

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# TODO: Build and fit the pipeline
# pipeline = Pipeline([
#     ('scaler', StandardScaler()),
#     ('ridge', Ridge())
# ])
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred_ridge = pipeline.predict(X_val)
print("Ridge R²:", r2_score(y_val, y_pred_ridge))
print("Ridge RMSE:", np.sqrt(mean_squared_error(y_val, y_pred_ridge)))


pipeline.fit(X_train, y_train)

Ridge R²: 0.6155941264515434
Ridge RMSE: 0.726521795865777


0,1,2
,steps,"[('scaler', ...), ('ridge', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,alpha,1.0
,fit_intercept,True
,copy_X,True
,max_iter,
,tol,0.0001
,solver,'auto'
,positive,False
,random_state,42


## Step 4: Evaluate Your Model
Use MSE, R² and any other metric to evaluate on validation set.

In [13]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = pipeline.predict(X_val)
print("Validation MSE:", mean_squared_error(y_val, y_pred))
print("Validation R²:", r2_score(y_val, y_pred))

Validation MSE: 0.5278339198680337
Validation R²: 0.6155941264515434


## Step 5: Cross-Validation

In [14]:
# Step 3: Scale features (fit on TRAIN only)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Step 4: Define model
model1 = LinearRegression()

# --- Option 1: Evaluate on validation set (simple) ---
model1.fit(X_train_scaled, y_train)
y_val_pred = model1.predict(X_val_scaled)
print("Validation R²:", r2_score(y_val, y_val_pred))
print("Validation RMSE:", np.sqrt(mean_squared_error(y_val, y_val_pred)))

Validation R²: 0.6155900266974583
Validation RMSE: 0.7265256700947677


In [15]:
# --- Option 2: K-Fold Cross-Validation on TRAIN only ---
from sklearn.model_selection import KFold, cross_val_score

#Define model
model2 = LinearRegression()

# Do K-Fold validation with k=5
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model2, X_train_scaled, y_train, cv=kf, scoring='r2')
print("\nK-Fold CV scores on training set:", cv_scores)
print("Mean CV R²:", np.mean(cv_scores))
print("Std CV R²:", np.std(cv_scores))


K-Fold CV scores on training set: [0.60970239 0.60411343 0.6354319  0.60076148 0.60727452]
Mean CV R²: 0.611456745810643
Std CV R²: 0.012358719496266605


## Step 5: Hypertunning: GridSearchCV

In [21]:
# --- 4. Hyperparameter Tuning (Ridge & Lasso) ---

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV

# Scale full dataset for this part
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Ridge Regression Grid Search
ridge = Ridge()
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}
ridge_search = GridSearchCV(ridge, param_grid, cv=5, scoring='r2')
ridge_search.fit(X_scaled, y)

# Lasso Regression Grid Search
lasso = Lasso(max_iter=10000)
lasso_search = GridSearchCV(lasso, param_grid, cv=5, scoring='r2')
lasso_search.fit(X_scaled, y)

print("Best Ridge alpha:", ridge_search.best_params_, "Mean R²:", ridge_search.best_score_)
print("Best Lasso alpha:", lasso_search.best_params_, "Mean R²:", lasso_search.best_score_)

Best Ridge alpha: {'alpha': 10} Mean R²: 0.5530925208131359
Best Lasso alpha: {'alpha': 0.01} Mean R²: 0.5497644182002727


## Step 6: Final Evaluation

**Task 2:**

Evaluate the models on the test set.

In [None]:
#Example:
#y_test_pred = model.predict(X_test_scaled)
#print("\nTest R² (final evaluation for model):", r2_score(y_test, y_test_pred))
#print("Test RMSE for model:", np.sqrt(mean_squared_error(y_test, y_test_pred)))

## Step 7: Reflection
- What worked well?
- What didn't?
- Try another model and compare!