# Case Study: KNN Regression with California Housing Dataset

K‑Nearest Neighbors (KNN) is a simple yet powerful machine learning algorithm that can be applied to both classification and regression tasks. In **regression**, instead of voting for a class label, KNN predicts a continuous value by **averaging** the target values of the K nearest neighbors. In this case study, we'll step through a practical example using the **California Housing** dataset to illustrate key concepts and best practices of KNN regression. This dataset contains information about housing blocks in California from the 1990 census, with 8 features (e.g., median income, house age, average rooms, location) and a target variable representing the median house value.

## What we'll cover
- **Data exploration and preparation:** Understanding feature distributions and splitting data into training, validation, and test sets.  
- **Impact of feature scaling:** Demonstrating how scaling features affects KNN performance in regression tasks.  
- **Choosing the number of neighbors (K):** Tuning K to balance model complexity (bias vs. variance).  
- **Distance metric considerations:** How the choice of distance measure can affect KNN predictions.  
- **Model evaluation:** Evaluating the final model using regression metrics (RMSE, MAE, R²) on a test set to ensure it generalizes well to unseen data.

## Exploring the Dataset
Before diving into modeling, let's load the dataset and examine its features. The California Housing dataset has 20,640 samples, each with 8 features. The target is `MedHouseVal` (median house value in $100,000s).

**Features:**  
- MedInc: median income in block group  
- HouseAge: median house age in block group  
- AveRooms: average number of rooms per household  
- AveBedrms: average number of bedrooms per household  
- Population: block group population  
- AveOccup: average number of household members  
- Latitude: block group latitude  
- Longitude: block group longitude  

Large differences in magnitude (e.g., *Population* in thousands vs *AveBedrms* around 1) motivate **scaling** before using distance-based models.

In [None]:
# Imports and data loading
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error
)

# Load the California housing dataset
data = fetch_california_housing()
X = data.data
y = data.target
feature_names = data.feature_names

# Create DataFrame for exploration
df = pd.DataFrame(X, columns=feature_names)
df['MedHouseVal'] = y
df.head()

Let's examine the target variable distribution and basic feature statistics, then split the data into training, validation, and test sets using a 60/20/20 split.

> **Question**: You've used the validation set to tune K from 1 to 30, ultimately selecting K=7 with validation RMSE of $0.52. Why is it critical to evaluate on a separate test set before deploying the model?
>  
> A) The validation set was used for hyperparameter selection, which can lead to optimistic performance estimates.
>
> B) The test set provides additional opportunities to fine-tune hyperparameters for better accuracy.
>
> C) Validation RMSE is systematically biased upward and always overestimates real-world error.
>
> D) The test set helps identify which features should be added or removed from the model.

Holding out a test set is standard to avoid overfitting and obtain an unbiased estimate of performance.

In [None]:
# Examine target distribution
print("Target variable (MedHouseVal) statistics:")
print(df['MedHouseVal'].describe(), "\n")

# Descriptive statistics of features
display(df[feature_names].describe().T[['mean', 'std', 'min', 'max']])

# Plot target distribution
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(y, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Median House Value ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of Target Variable')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.boxplot(y)
plt.ylabel('Median House Value ($100k)')
plt.title('Box Plot of Target Variable')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Split into train, validation, and test sets (60/20/20)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)
print("Train size:", X_train.shape[0], "Validation size:", X_val.shape[0], "Test size:", X_test.shape[0])

## Effect of Feature Scaling on KNN Regression

Just like in classification, KNN regression uses distance to find nearest neighbors; if features are on very different scales, distance calculations will be dominated by the feature with the largest range. The example below illustrates how a difference in *Population* (thousands) can swamp a difference in *AveBedrms* (around 1). Therefore, scaling features to comparable ranges is critical for KNN.

In [None]:
# Demonstrate distance dominance (hypothetical differences)
from math import sqrt

delta_population_large = 1000.0
delta_bedrooms_small = 0.5

d1 = sqrt(delta_population_large**2 + 0.0**2)
d2 = sqrt(0.0**2 + delta_bedrooms_small**2)

print("Distance if only Population differs by +1000:", round(d1, 3))
print("Distance if only AveBedrms differs by +0.5  :", round(d2, 3))
print("Ratio (Population / Bedrooms):", round(d1 / d2, 1))

Next, we train a baseline KNN regression model with `K=5` using **unscaled** features and **scaled** features to compare validation performance. We'll use RMSE (Root Mean Squared Error) as our primary metric. Note that we scale using parameters learned from the training set only to avoid leakage.

In [None]:
# Baseline without scaling
knn_raw = KNeighborsRegressor(n_neighbors=5)
knn_raw.fit(X_train, y_train)
raw_val_pred = knn_raw.predict(X_val)
raw_val_rmse = np.sqrt(mean_squared_error(y_val, raw_val_pred))

# Baseline with scaling (fit on train only)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled   = scaler.transform(X_val)

knn_scaled = KNeighborsRegressor(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
scaled_val_pred = knn_scaled.predict(X_val_scaled)
scaled_val_rmse = np.sqrt(mean_squared_error(y_val, scaled_val_pred))

print(f"Validation RMSE without scaling: ${raw_val_rmse:.3f} (×100k)")
print(f"Validation RMSE with scaling:    ${scaled_val_rmse:.3f} (×100k)")
print(f"Improvement: {((raw_val_rmse - scaled_val_rmse) / raw_val_rmse * 100):.1f}%")

The scaled model typically performs significantly better because each feature contributes fairly to distance computation.

In [None]:
# Demonstration: Step-by-step prediction for a single test point
from sklearn.metrics import pairwise_distances

# Pick the first validation example for demonstration
x_test_example = X_val_scaled[0:1]  # Shape: (1, n_features)
y_test_actual = y_val[0]

print("="*60)
print("STEP-BY-STEP PREDICTION WALKTHROUGH")
print("="*60)
print(f"\nActual target value: ${y_test_actual:.2f} (×100k) = ${y_test_actual*100:.0f}k\n")

# Step 1: Calculate distances from test point to all training points
distances = pairwise_distances(x_test_example, X_train_scaled, metric='euclidean').ravel()
print(f"Step 1: Calculated {len(distances)} distances from test point to training points")
print(f"        Distance range: [{distances.min():.3f}, {distances.max():.3f}]")

# Step 2: Find indices of K nearest neighbors
K = 5
k_nearest_indices = np.argsort(distances)[:K]
k_nearest_distances = distances[k_nearest_indices]
print(f"\nStep 2: Found K={K} nearest neighbors")
print(f"        Indices: {k_nearest_indices}")
print(f"        Distances: {[f'{d:.3f}' for d in k_nearest_distances]}")

# Step 3: Get target values of K nearest neighbors
k_nearest_targets = y_train[k_nearest_indices]
print(f"\nStep 3: Retrieved target values of K nearest neighbors")
print(f"        NN's Labels: {[f'{t:.2f}' for t in k_nearest_targets]}")

# Step 4: Average the targets (THIS IS THE PREDICTION!)
prediction = np.mean(k_nearest_targets)
print(f"\nStep 4: AVERAGE the neighbor targets")
print(f"        Prediction = mean({[f'{t:.2f}' for t in k_nearest_targets]})")
print(f"        Prediction = {prediction:.2f} (×100k) = ${prediction*100:.0f}k")

print(f"\n" + "="*60)
print(f"RESULT:")
print(f"  Actual:     ${y_test_actual:.2f} (×100k)")
print(f"  Predicted:  ${prediction:.2f} (×100k)")
print(f"  Error:      ${abs(y_test_actual - prediction):.2f} (×100k)")
print("="*60)

# Verify this matches sklearn's prediction
sklearn_prediction = knn_scaled.predict(x_test_example)[0]
print(f"\nVerification: sklearn prediction = ${sklearn_prediction:.2f} ✓")
assert np.isclose(prediction, sklearn_prediction), "Predictions should match!"

## Step-by-Step: How KNN Makes a Prediction

To understand exactly how KNN regression works, let's walk through the prediction process step-by-step for a single test point, using the same approach shown in the lecture slides.

**Steps:**
1. Calculate pairwise distances from the test point to all training points
2. Sort distances and select the K nearest neighbors (smallest distances)
3. Get the target values of these K neighbors
4. **Average** these values to make the prediction

> **Key Difference from Classification**: In classification, KNN uses **voting** (most common class). In regression, KNN uses **averaging** (mean of continuous values).

**Visualization: Actual vs Predicted values**  
Let's visualize how well our scaled KNN model predicts on the validation set.

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(y_val, scaled_val_pred, alpha=0.5, edgecolor='k', s=20)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 'r--', lw=2, label='Perfect prediction')
plt.xlabel('Actual House Value ($100k)')
plt.ylabel('Predicted House Value ($100k)')
plt.title('Actual vs Predicted (K=5, Scaled Features)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Distance Metric Considerations
Choosing a distance metric is itself a hyperparameter. For continuous features, **Euclidean (L2)** is the default and measures straight-line distance; **Manhattan (L1)** sums absolute differences and can be more robust to outliers. In practice, treat the metric as something to tune by validation.

> **Question**: Your KNN model uses 3 features: 'MedInc' (range 0-15), 'Population' (range 0-35,000), and 'Latitude' (range 32-42). Without scaling, which feature will dominate the distance calculations, and why?
>
> A) MedInc—it has the strongest correlation with house prices
>
> B) Population—it has the largest numerical range
>
> C) Latitude—geographic features always have higher weight in distance metrics
>
> D) All features contribute equally because KNN normalizes distances automatically

To see the effect of different metrics and K values, we can do a small grid of {Euclidean, Manhattan} × {3, 5, 7, 9} on the validation set. This isn't exhaustive but shows that **distance metric** is a tunable choice.

In [None]:
from itertools import product

metrics = ['euclidean', 'manhattan']
k_values = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
rows = []

for metric, k in product(metrics, k_values):
    mdl = KNeighborsRegressor(n_neighbors=k, metric=metric).fit(X_train_scaled, y_train)
    pred = mdl.predict(X_val_scaled)
    rmse = np.sqrt(mean_squared_error(y_val, pred))
    rows.append((metric, k, rmse))

grid_df = pd.DataFrame(rows, columns=['metric', 'k', 'val_rmse']) \
          .pivot(index='metric', columns='k', values='val_rmse')
print("Validation RMSE Grid (lower is better):")
display(grid_df.round(3))

In [None]:
# Pick best (metric, k) by validation RMSE (lower is better)
grid_long = (
    grid_df.stack()                 # -> Series with MultiIndex (metric, k)
           .rename('val_rmse')
           .reset_index()           # -> columns: ['metric', 'k', 'val_rmse']
)

# Tie-breaker: prefer smaller k, then 'euclidean' over 'manhattan'
grid_long['tie_metric_rank'] = grid_long['metric'].map({'euclidean': 0, 'manhattan': 1})

best_row = (
    grid_long.sort_values(
        ['val_rmse', 'k', 'tie_metric_rank'],
        ascending=[True, True, True]  # Lower RMSE is better
    )
    .iloc[0]
)

chosen_metric = best_row['metric']
chosen_k      = int(best_row['k'])
best_val_rmse = float(best_row['val_rmse'])

print("Decision log — chosen params (metric & k):")
print({"metric": chosen_metric, "n_neighbors": chosen_k, "val_rmse": round(best_val_rmse, 3)})

# Sanity check with chosen params on validation data
_knn = KNeighborsRegressor(n_neighbors=chosen_k, metric=chosen_metric).fit(X_train_scaled, y_train)
val_rmse_check = np.sqrt(mean_squared_error(y_val, _knn.predict(X_val_scaled)))
print(f"Validation RMSE (chosen metric & k): {val_rmse_check:.3f}")

## Choosing K: Bias–Variance Trade‑off
A small K (e.g., K=1) is highly flexible and fits training data very closely—**high variance** and potential overfitting. A very large K (approaching the size of the training set) averages over many neighbors—**high bias** and potential underfitting. We sweep K from 1 to 30 and plot training vs. validation RMSE to pick the best K by validation performance.

> **Question**: After training KNN models with different K values, you observe that K=1 achieves training RMSE of \$0.05 but validation RMSE of \$0.75, while K=5 achieves training RMSE of \$0.45 and validation RMSE of \$0.52. What does this pattern suggest about the K=1 model?
>  
> A) The model has high variance and is overfitting to training noise.
>
> B) The model has high bias and requires more complex features.
>
> C) The training dataset is too small and more samples would fix the issue.
>
> D) Add more features to improve generalization.

In [None]:
train_rmse, val_rmse = [], []
k_sweep = range(1, 31)

for k in k_sweep:
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train_scaled, y_train)
    train_pred = model.predict(X_train_scaled)
    val_pred = model.predict(X_val_scaled)
    train_rmse.append(np.sqrt(mean_squared_error(y_train, train_pred)))
    val_rmse.append(np.sqrt(mean_squared_error(y_val, val_pred)))

# Best K by validation RMSE
best_k_idx = int(np.argmin(val_rmse))
best_k = best_k_idx + 1
best_val = min(val_rmse)

print("Best K (by validation RMSE):", best_k, "Validation RMSE:", round(best_val, 3))

# Plot train vs validation RMSE vs K
plt.figure(figsize=(10, 6))
plt.plot(list(k_sweep), train_rmse, marker='o', label='Train RMSE', linewidth=2)
plt.plot(list(k_sweep), val_rmse, marker='s', label='Validation RMSE', linewidth=2)
plt.axvline(best_k, linestyle='--', color='red', label=f'Best K={best_k}')
plt.xlabel('K (Number of Neighbors)', fontsize=12)
plt.ylabel('RMSE ($100k)', fontsize=12)
plt.title('Bias-Variance Trade-off: RMSE vs K', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

## Model Evaluation on Test Set
With the chosen hyperparameters, we refit KNN (within a pipeline to avoid leakage) on the combined training + validation data and evaluate performance on the **held-out test** set. This provides an unbiased estimate of real-world performance. We'll use multiple regression metrics:

- **RMSE (Root Mean Squared Error)**: Penalizes large errors more heavily
- **MAE (Mean Absolute Error)**: Average absolute prediction error
- **R² Score**: Proportion of variance explained (1.0 is perfect, 0.0 is baseline)
- **MAPE (Mean Absolute Percentage Error)**: Percentage error

> **Question**: Your final KNN model achieves validation RMSE of $0.52 and test RMSE of \$0.56. Before deployment, what's the most appropriate interpretation?
>  
> A) The difference is normal variation; verify test performance meets business requirements.
>
> B) The test set has data leakage and should be regenerated with better random seeds.
>
> C) Re-tune hyperparameters using the test set to minimize the performance gap.
>
> D) The model is underfitting and needs a smaller K value for better flexibility.

In [None]:
# Combine training and validation sets for final training
X_train_all = np.vstack([X_train, X_val])
y_train_all = np.hstack([y_train, y_val])

# Build pipeline (scaler + KNN) with chosen hyperparameters
final_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(n_neighbors=chosen_k, metric=chosen_metric))
])

final_pipe.fit(X_train_all, y_train_all)

# Predict on test set
test_pred = final_pipe.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
test_mae = mean_absolute_error(y_test, test_pred)
test_r2 = r2_score(y_test, test_pred)
test_mape = mean_absolute_percentage_error(y_test, test_pred)

print("="*50)
print("FINAL TEST SET PERFORMANCE")
print("="*50)
print(f"RMSE:  ${test_rmse:.3f} (×100k) = ${test_rmse*100:.0f}k")
print(f"MAE:   ${test_mae:.3f} (×100k) = ${test_mae*100:.0f}k")
print(f"R²:    {test_r2:.3f}")
print(f"MAPE:  {test_mape*100:.1f}%")
print("="*50)

Let's visualize the test set predictions:

In [None]:
# Plot actual vs predicted for test set
plt.figure(figsize=(12, 5))

# Scatter plot
plt.subplot(1, 2, 1)
plt.scatter(y_test, test_pred, alpha=0.5, edgecolor='k', s=20)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect prediction')
plt.xlabel('Actual House Value ($100k)')
plt.ylabel('Predicted House Value ($100k)')
plt.title(f'Test Set: Actual vs Predicted\n(R² = {test_r2:.3f}, RMSE = ${test_rmse:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)

# Residual plot
plt.subplot(1, 2, 2)
residuals = y_test - test_pred
plt.scatter(test_pred, residuals, alpha=0.5, edgecolor='k', s=20)
plt.axhline(0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted House Value ($100k)')
plt.ylabel('Residuals ($100k)')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Distribution of residuals
plt.figure(figsize=(10, 4))
plt.hist(residuals, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Residuals ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Errors')
plt.axvline(0, color='r', linestyle='--', lw=2, label='Zero error')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Where Do Errors Come From?

Understanding where KNN regression makes larger errors helps us interpret model performance and identify areas for improvement. As shown in the lecture slides, prediction errors in KNN regression come from two types of regions:

### High Certainty Regions
- **Characteristics**: Neighbors have similar target values with low variation
- **Prediction Quality**: Low residual errors (high confidence predictions)
- **Why**: When K nearest neighbors have similar values, their average is a reliable estimate
- **Example**: In dense neighborhoods where houses have similar prices

### High Ambiguity Regions  
- **Characteristics**: Neighbors have high variation in target values
- **Prediction Quality**: High residual errors (uncertain predictions)
- **Why**: When K nearest neighbors have very different values, averaging produces less reliable estimates
- **Example**: Boundary regions between expensive and affordable neighborhoods, or sparse data regions

### Additional Error Sources
- **Sparse Regions**: Areas with few training points lead to unreliable neighbor selection
- **Boundary Regions**: Transition zones where the K neighbors remain constant but represent different underlying patterns

> **Key Insight**: KNN performs best in regions where neighbors have consistent target values. The model struggles in regions with high local variability or sparse data, as the averaging assumption breaks down.

## Feature Importance via Permutation
Unlike tree-based models, KNN doesn't have built-in feature importance. However, we can use **permutation importance** to understand which features matter most. This technique randomly shuffles each feature and measures the drop in performance.

In [None]:
from sklearn.inspection import permutation_importance

# Compute permutation importance on test set
perm_importance = permutation_importance(
    final_pipe, X_test, y_test,
    n_repeats=10, random_state=42,
    scoring='neg_root_mean_squared_error'
)

# Create DataFrame for visualization
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': perm_importance.importances_mean,
    'std': perm_importance.importances_std
}).sort_values('importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'], xerr=importance_df['std'])
plt.xlabel('Decrease in R² (Importance)')
plt.ylabel('Feature')
plt.title('Permutation Feature Importance')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nFeature Importance Ranking:")
display(importance_df)

## Limitations (Current Scope) & What's Next
This notebook uses a **single hold‑out validation** set, which is simple but sensitive to data splits. In practice, data scientists often use **k‑fold cross‑validation** or nested validation to obtain more reliable estimates and avoid overfitting hyperparameters to a single split. We also used brute‑force neighbor search and didn't explore scalability techniques like KD‑trees, Ball Trees, or approximate nearest neighbor libraries (e.g., FAISS, HNSW). These become important when your dataset grows to millions of rows or requires low‑latency predictions.

**Additional considerations for regression:**
- **Weighting neighbors by distance**: Closer neighbors can have more influence (weights='distance' in sklearn)
- **Handling outliers in target variable**: KNN averages can be affected by extreme values
- **Feature engineering**: Creating interaction features or polynomial features might improve performance
- **Ensemble methods**: Combining KNN with other regressors can improve robustness

## Conclusion
- **Scaling** prevents large‑range features from dominating distance computations in regression, just as in classification.  
- **Tuning K** via validation balances bias and variance; a very small K overfits, a very large K underfits.  
- **Distance metric and K** are hyperparameters; small grids can reveal significant differences in RMSE.  
- **Regression metrics** (RMSE, MAE, R²) provide different perspectives on model performance.  
- KNN regression remains a powerful, intuitive baseline—use it to build understanding about distance-based prediction before advancing to more sophisticated models like Random Forest or Gradient Boosting.