<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/K-Nearest%20Neighbours%20Regression/KNN%20Regression%20with%20California%20Housing%20Dataset%20Case%20Study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study: KNN Regression with California Housing Dataset

K‑Nearest Neighbors (KNN) is a simple yet powerful machine learning algorithm that can be applied to both classification and regression tasks. In **regression**, instead of voting for a class label, KNN predicts a continuous value by **averaging** the target values of the K nearest neighbors. In this case study, we'll step through a practical example using the **California Housing** dataset to illustrate key concepts and best practices of KNN regression. This dataset contains information about housing blocks in California from the 1990 census, with 8 features (e.g., median income, house age, average rooms, location) and a target variable representing the median house value.

## What we'll cover
- **Data exploration and preparation:** Understanding feature distributions and splitting data into training, validation, and test sets.  
- **Impact of feature scaling:** Demonstrating how scaling features affects KNN performance in regression tasks.  
- **Choosing the number of neighbors (K):** Tuning K to balance model complexity (bias vs. variance).  
- **Distance metric considerations:** How the choice of distance measure can affect KNN predictions.  
- **Model evaluation:** Evaluating the final model using regression metrics (RMSE, MAE, R²) on a test set to ensure it generalizes well to unseen data.

## Learning Objectives

By the end of this case study, you will be able to:

1. **Understand the critical role of feature scaling** in distance-based algorithms and demonstrate its impact on model performance
2. **Apply systematic hyperparameter tuning** using train/validation/test splits to avoid overfitting
3. **Recognize and explain the bias-variance tradeoff** when selecting the number of neighbors (K)
4. **Evaluate regression models** using multiple metrics (RMSE, MAE, R², MAPE) and interpret their meaning
5. **Analyze prediction errors** using percentage error plots to identify model strengths and limitations
6. **Make informed decisions** about when KNN regression is appropriate for a given problem
7. **Understand computational implications** of KNN for real-world deployment scenarios

## Exploring the Dataset
Before diving into modeling, let's load the dataset and examine its features. The California Housing dataset has 20,640 samples, each with 8 features. The target is `MedHouseVal` (median house value in $100,000s).

**Features:**  
- MedInc: median income in block group  
- HouseAge: median house age in block group  
- AveRooms: average number of rooms per household  
- AveBedrms: average number of bedrooms per household  
- Population: block group population  
- AveOccup: average number of household members  
- Latitude: block group latitude  
- Longitude: block group longitude  

Large differences in magnitude (e.g., *Population* in thousands vs *AveBedrms* around 1) motivate **scaling** before using distance-based models.

In [None]:
# Imports and data loading
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error
)

# Load the California housing dataset
data = fetch_california_housing()
X = data.data
y = data.target
feature_names = data.feature_names

# Create DataFrame for exploration
df = pd.DataFrame(X, columns=feature_names)
df['MedHouseVal'] = y
df.head()

Let's examine the target variable distribution and basic feature statistics before proceeding. 

**Data Splitting Strategy:**  
We split the data into three sets with a 60/20/20 ratio:
- **Training set (60%)**: Used to fit the model (learn from the data)
- **Validation set (20%)**: Used to tune hyperparameters (select best K, distance metric, etc.)
- **Test set (20%)**: Held out completely until final evaluation

This three-way split is crucial because hyperparameter tuning on the validation set can lead to overfitting those specific choices to that particular data subset. The test set provides an unbiased performance estimate on truly unseen data.

> **Question**: You've used the validation set to tune K from 1 to 30, ultimately selecting K=7 with validation RMSE of $0.52. Why is it critical to evaluate on a separate test set before deploying the model?
>  
> A) The validation set was used for hyperparameter selection, which can lead to optimistic performance estimates.
>
> B) The test set provides additional opportunities to fine-tune hyperparameters for better accuracy.
>
> C) Validation RMSE is systematically biased upward and always overestimates real-world error.
>
> D) The test set helps identify which features should be added or removed from the model.

In [None]:
# Examine target distribution
print("Target variable (MedHouseVal) statistics:")
print(df['MedHouseVal'].describe(), "\n")

# Descriptive statistics of features
display(df[feature_names].describe().T[['mean', 'std', 'min', 'max']])

# Plot target distribution
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(y, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Median House Value ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of Target Variable')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.boxplot(y)
plt.ylabel('Median House Value ($100k)')
plt.title('Box Plot of Target Variable')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Split into train, validation, and test sets (60/20/20)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)
print("Train size:", X_train.shape[0], "Validation size:", X_val.shape[0], "Test size:", X_test.shape[0])

## Effect of Feature Scaling on KNN Regression

Just like in classification, KNN regression uses distance to find nearest neighbors; if features are on very different scales, distance calculations will be dominated by the feature with the largest range. The example below illustrates how a difference in *Population* (thousands) can swamp a difference in *AveBedrms* (around 1). Therefore, scaling features to comparable ranges is critical for KNN.

In [None]:
# Demonstrate distance dominance (hypothetical differences)
from math import sqrt

delta_population_large = 1000.0
delta_bedrooms_small = 0.5

d1 = sqrt(delta_population_large**2 + 0.0**2)
d2 = sqrt(0.0**2 + delta_bedrooms_small**2)

print("Distance if only Population differs by +1000:", round(d1, 3))
print("Distance if only AveBedrms differs by +0.5  :", round(d2, 3))
print("Ratio (Population / Bedrooms):", round(d1 / d2, 1))

Next, we train a baseline KNN regression model with `K=5` using **unscaled** features and **scaled** features to compare validation performance. We'll use RMSE (Root Mean Squared Error) as our primary metric. Note that we scale using parameters learned from the training set only to avoid leakage.

In [None]:
# Baseline without scaling
knn_raw = KNeighborsRegressor(n_neighbors=5)
knn_raw.fit(X_train, y_train)
raw_val_pred = knn_raw.predict(X_val)
raw_val_rmse = np.sqrt(mean_squared_error(y_val, raw_val_pred))

# Baseline with scaling (fit on train only)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled   = scaler.transform(X_val)

knn_scaled = KNeighborsRegressor(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
scaled_val_pred = knn_scaled.predict(X_val_scaled)
scaled_val_rmse = np.sqrt(mean_squared_error(y_val, scaled_val_pred))

print(f"Validation RMSE without scaling: ${raw_val_rmse:.3f} (×100k)")
print(f"Validation RMSE with scaling:    ${scaled_val_rmse:.3f} (×100k)")
print(f"Improvement: {((raw_val_rmse - scaled_val_rmse) / raw_val_rmse * 100):.1f}%")

The scaled model typically performs significantly better because each feature contributes fairly to distance computation.

**Key Takeaway:** Feature scaling is essential for KNN regression. Without scaling, features with larger ranges (like Population) dominate distance calculations, while important features with smaller ranges (like AveBedrms) are effectively ignored. **Always scale features before using KNN.** From this point forward, all models will use scaled features.

## Distance Metric Considerations

Beyond scaling, the choice of **distance metric** itself is a hyperparameter that affects KNN predictions. Common options include:
- **Euclidean (L2)**: Straight-line distance; squares differences before summing (default choice)
- **Manhattan (L1)**: Sum of absolute differences; can be more robust to outliers in some cases

The distance metric determines how we measure similarity between points. When features are unscaled, the metric choice matters less than the scaling issue—features with larger numerical ranges will dominate the distance calculation regardless of whether you use Euclidean or Manhattan distance.

> **Question**: Your KNN model uses 3 features: 'MedInc' (range 0-15), 'Population' (range 0-35,000), and 'Latitude' (range 32-42). Without scaling, which feature will dominate the distance calculations, and why?
>
> A) MedInc—it has the strongest correlation with house prices
>
> B) Population—it has the largest numerical range
>
> C) Latitude—geographic features always have higher weight in distance metrics
>
> D) All features contribute equally because KNN normalizes distances automatically

## Choosing K: Bias–Variance Trade‑off

The number of neighbors (K) controls the model's complexity and represents a fundamental bias-variance tradeoff:

**Small K (e.g., K=1)**:
- Highly flexible, fits training data very closely
- **High variance**: Sensitive to noise in individual training points
- Risk of overfitting: Excellent training performance but poor generalization

**Large K (e.g., K=100+)**:
- Averages over many neighbors, produces smoother predictions
- **High bias**: May miss local patterns and underfit the data
- More stable but potentially too simple

We'll sweep K from 1 to 30 and compare training vs. validation RMSE to find the sweet spot. A large gap between training and validation error signals overfitting.

> **Question**: After training KNN models with different K values, you observe that K=1 achieves training RMSE of \$0.05 but validation RMSE of \$0.75, while K=5 achieves training RMSE of \$0.45 and validation RMSE of \$0.52. What does this pattern suggest about the K=1 model?
>  
> A) The model has high variance and is overfitting to training noise.
>
> B) The model has high bias and requires more complex features.
>
> C) The training dataset is too small and more samples would fix the issue.
>
> D) Add more features to improve generalization.

In [None]:
train_rmse, val_rmse = [], []
k_sweep = range(1, 31)

for k in k_sweep:
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(X_train_scaled, y_train)
    train_pred = model.predict(X_train_scaled)
    val_pred = model.predict(X_val_scaled)
    train_rmse.append(np.sqrt(mean_squared_error(y_train, train_pred)))
    val_rmse.append(np.sqrt(mean_squared_error(y_val, val_pred)))

# Best K by validation RMSE
best_k_idx = int(np.argmin(val_rmse))
chosen_k = best_k_idx + 1
best_val = min(val_rmse)

# Use Euclidean distance (default and most commonly used)
chosen_metric = 'euclidean'

print("Selected hyperparameters:")
print(f"  K = {chosen_k}")
print(f"  Distance metric = {chosen_metric}")
print(f"  Validation RMSE = ${best_val:.3f} (×100k)")

# Plot train vs validation RMSE vs K
plt.figure(figsize=(10, 6))
plt.plot(list(k_sweep), train_rmse, marker='o', label='Train RMSE', linewidth=2)
plt.plot(list(k_sweep), val_rmse, marker='s', label='Validation RMSE', linewidth=2)
plt.axvline(chosen_k, linestyle='--', color='red', label=f'Best K={chosen_k}')
plt.xlabel('K (Number of Neighbors)', fontsize=12)
plt.ylabel('RMSE ($100k)', fontsize=12)
plt.title('Bias-Variance Trade-off: RMSE vs K', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Combine training and validation sets for final training
X_train_all = np.vstack([X_train, X_val])
y_train_all = np.hstack([y_train, y_val])

# Build pipeline (scaler + KNN) with chosen hyperparameters
final_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(n_neighbors=chosen_k, metric=chosen_metric))
])

final_pipe.fit(X_train_all, y_train_all)

# Predict on test set
test_pred = final_pipe.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
test_mae = mean_absolute_error(y_test, test_pred)
test_r2 = r2_score(y_test, test_pred)
test_mape = mean_absolute_percentage_error(y_test, test_pred)

print("="*50)
print("FINAL TEST SET PERFORMANCE")
print("="*50)
print(f"RMSE:  ${test_rmse:.3f} (×100k) = ${test_rmse*100:.0f}k")
print(f"MAE:   ${test_mae:.3f} (×100k) = ${test_mae*100:.0f}k")
print(f"R²:    {test_r2:.3f}")
print(f"MAPE:  {test_mape*100:.1f}%")
print("="*50)

## Model Evaluation on Test Set

After selecting optimal hyperparameters using the validation set, we perform a final evaluation on the **held-out test set**. This is critical for obtaining an unbiased estimate of real-world performance.

**Final Training Process:**
1. Combine training + validation sets (now that hyperparameter tuning is complete)
2. Refit the model on this combined dataset
3. Evaluate once on the test set
4. Compare test performance to validation performance

**Expected Behavior:**  
Test performance typically matches validation performance closely. A small difference (e.g., validation RMSE $0.52 vs. test RMSE $0.56) is normal due to random variation in data splits. A large gap would suggest overfitting to the validation set during hyperparameter tuning.

**Evaluation Metrics:**
- **RMSE (Root Mean Squared Error)**: Penalizes large errors more heavily; same units as target ($100k)
- **MAE (Mean Absolute Error)**: Average absolute prediction error; more interpretable
- **R² Score**: Proportion of variance explained (1.0 = perfect, 0.0 = baseline)
- **MAPE (Mean Absolute Percentage Error)**: Percentage error; useful for relative comparison

> **Question**: Your final KNN model achieves validation RMSE of $0.52 and test RMSE of \$0.56. Before deployment, what's the most appropriate interpretation?
>  
> A) The difference is normal variation; verify test performance meets business requirements.
>
> B) The test set has data leakage and should be regenerated with better random seeds.
>
> C) Re-tune hyperparameters using the test set to minimize the performance gap.
>
> D) The model is underfitting and needs a smaller K value for better flexibility.

## Where Do Errors Come From?

Understanding where KNN regression makes larger errors helps us interpret model performance and identify areas for improvement. As shown in the lecture slides, prediction errors in KNN regression come from two types of regions:

### High Certainty Regions
- **Characteristics**: Neighbors have similar target values with low variation
- **Prediction Quality**: Low residual errors (high confidence predictions)
- **Why**: When K nearest neighbors have similar values, their average is a reliable estimate
- **Example**: In dense neighborhoods where houses have similar prices

### High Ambiguity Regions  
- **Characteristics**: Neighbors have high variation in target values
- **Prediction Quality**: High residual errors (uncertain predictions)
- **Why**: When K nearest neighbors have very different values, averaging produces less reliable estimates
- **Example**: Boundary regions between expensive and affordable neighborhoods, or sparse data regions

### Additional Error Sources
- **Sparse Regions**: Areas with few training points lead to unreliable neighbor selection
- **Boundary Regions**: Transition zones where the K neighbors remain constant but represent different underlying patterns

> **Key Insight**: KNN performs best in regions where neighbors have consistent target values. The model struggles in regions with high local variability or sparse data, as the averaging assumption breaks down.

In [None]:
# Demonstrate high certainty vs high ambiguity regions
from sklearn.metrics import pairwise_distances

# Get predictions for all training points
train_pred_all = final_pipe.predict(X_train_all)
train_errors = np.abs(y_train_all - train_pred_all)

# For each training point, calculate the standard deviation of its K nearest neighbors' target values
X_train_all_scaled = final_pipe.named_steps['scaler'].transform(X_train_all)
neighbor_std = []

for i in range(len(X_train_all_scaled)):
    # Calculate distances from this point to all other training points
    distances = pairwise_distances(X_train_all_scaled[i:i+1], X_train_all_scaled, metric=chosen_metric).ravel()
    # Find K+1 nearest (including itself)
    k_nearest_idx = np.argsort(distances)[1:chosen_k+1]  # Skip index 0 (itself)
    # Calculate std of neighbors' target values
    neighbor_targets = y_train_all[k_nearest_idx]
    neighbor_std.append(np.std(neighbor_targets))

neighbor_std = np.array(neighbor_std)

# Identify high certainty and high ambiguity regions
low_std_threshold = np.percentile(neighbor_std, 25)
high_std_threshold = np.percentile(neighbor_std, 75)

high_certainty_mask = neighbor_std < low_std_threshold
high_ambiguity_mask = neighbor_std > high_std_threshold

print("="*60)
print("ERROR ANALYSIS BY REGION TYPE")
print("="*60)
print(f"\nHigh Certainty Regions (neighbor std < {low_std_threshold:.3f}):")
print(f"  - Number of points: {high_certainty_mask.sum()}")
print(f"  - Avg neighbor std: ${np.mean(neighbor_std[high_certainty_mask]):.3f} (×100k)")
print(f"  - Avg prediction error: ${np.mean(train_errors[high_certainty_mask]):.3f} (×100k)")

print(f"\nHigh Ambiguity Regions (neighbor std > {high_std_threshold:.3f}):")
print(f"  - Number of points: {high_ambiguity_mask.sum()}")
print(f"  - Avg neighbor std: ${np.mean(neighbor_std[high_ambiguity_mask]):.3f} (×100k)")
print(f"  - Avg prediction error: ${np.mean(train_errors[high_ambiguity_mask]):.3f} (×100k)")

print(f"\n{'='*60}")
print(f"Error is {np.mean(train_errors[high_ambiguity_mask]) / np.mean(train_errors[high_certainty_mask]):.1f}x higher in high ambiguity regions!")
print("="*60)

# Visualize the relationship
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(neighbor_std, train_errors, alpha=0.3, s=10)
plt.xlabel('Neighbor Target Std Dev ($100k)')
plt.ylabel('Prediction Error ($100k)')
plt.title('Error vs Neighbor Variation')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.hist([neighbor_std[high_certainty_mask], neighbor_std[high_ambiguity_mask]], 
         bins=30, label=['High Certainty', 'High Ambiguity'], alpha=0.7)
plt.xlabel('Neighbor Target Std Dev ($100k)')
plt.ylabel('Frequency')
plt.title('Distribution of Region Types')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Plot actual vs predicted for test set
plt.figure(figsize=(12, 5))

# Scatter plot
plt.subplot(1, 2, 1)
plt.scatter(y_test, test_pred, alpha=0.5, edgecolor='k', s=20)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect prediction')
plt.xlabel('Actual House Value ($100k)')
plt.ylabel('Predicted House Value ($100k)')
plt.title(f'Test Set: Actual vs Predicted\n(R² = {test_r2:.3f}, RMSE = ${test_rmse:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)

# Residual plot - using percentage error
plt.subplot(1, 2, 2)
percentage_error = (y_test - test_pred) / y_test * 100
plt.scatter(test_pred, percentage_error, alpha=0.5, edgecolor='k', s=20)
plt.axhline(0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted House Value ($100k)')
plt.ylabel('Percentage Error (%)')
plt.title('Residual Plot (Percentage Error)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Distribution of residuals - using percentage error
plt.figure(figsize=(10, 4))
plt.hist(percentage_error, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Percentage Error (%)')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Errors (Percentage)')
plt.axvline(0, color='r', linestyle='--', lw=2, label='Zero error')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Understanding the Visualizations:**

1. **Actual vs Predicted Plot (left)**: Points closer to the diagonal line indicate better predictions. Deviations show where the model struggles.

2. **Percentage Error Plot (right)**: Shows prediction errors as a percentage of actual values. Using percentage error instead of absolute error provides better insights:
   - **Removes scale dependency**: A $50k error on a $500k house (10%) is very different from a $50k error on a $100k house (50%)
   - **Prevents fan-out pattern**: Absolute residuals often increase with predicted values; percentage errors should be more evenly distributed
   - **Easier interpretation**: We can quickly identify if errors are acceptable (e.g., within ±20%)

**What to look for:** Ideally, percentage errors should be randomly scattered around zero with no clear patterns. Systematic patterns (e.g., consistent over/under-prediction for certain price ranges) suggest model limitations.

## Feature Importance Analysis

Unlike tree-based models, KNN doesn't have built-in feature importance scores. However, we can use **permutation importance** to understand which features contribute most to predictions. This technique randomly shuffles each feature one at a time and measures how much the model's performance drops—larger drops indicate more important features.

In [None]:
from sklearn.inspection import permutation_importance

# Compute permutation importance on test set
perm_importance = permutation_importance(
    final_pipe, X_test, y_test,
    n_repeats=10, random_state=42,
    scoring='neg_root_mean_squared_error'
)

# Create DataFrame for visualization
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': perm_importance.importances_mean,
    'std': perm_importance.importances_std
}).sort_values('importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'], xerr=importance_df['std'])
plt.xlabel('Decrease in R² (Importance)')
plt.ylabel('Feature')
plt.title('Permutation Feature Importance')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nFeature Importance Ranking:")
display(importance_df)

## Common Pitfalls and Best Practices

Watch out for these common mistakes when using KNN regression:

### Critical Mistakes to Avoid:
1. **Forgetting to scale features** ❌  
   - This is the #1 mistake with KNN. Features with larger ranges will completely dominate distance calculations
   - **Always** use StandardScaler or MinMaxScaler before applying KNN

2. **Using test set for hyperparameter tuning** ❌  
   - Never tune K or distance metrics using the test set
   - Use a separate validation set or cross-validation for hyperparameter selection

3. **Choosing K=1 for production** ❌  
   - K=1 is extremely sensitive to noise and outliers
   - While it may show perfect training performance, it rarely generalizes well
   - Start with K=5 as a reasonable default and tune from there

4. **Ignoring computational cost** ❌  
   - KNN stores all training data and computes distances at prediction time
   - For large datasets (millions of rows), KNN can be prohibitively slow
   - Consider approximate nearest neighbor methods for large-scale applications

### Best Practices:
- ✅ **Always visualize** your predictions vs actuals to spot patterns in errors
- ✅ **Use cross-validation** for more robust hyperparameter tuning (see Limitations section)
- ✅ **Consider distance-weighted KNN** (`weights='distance'`) to give closer neighbors more influence
- ✅ **Remove outliers** or use robust scaling if your target variable has extreme values
- ✅ **Try feature selection** to reduce dimensionality and improve performance (curse of dimensionality)

## Computational Complexity

Understanding KNN's computational characteristics is crucial for real-world deployment:

### Training Complexity: O(1)
- **KNN is a "lazy learner"**: It doesn't actually learn a model during training
- Training simply stores the feature vectors and target values in memory
- This makes training instantaneous, regardless of dataset size

### Prediction Complexity: O(n × d)
- **For each prediction**, KNN must:
  - Calculate distance to all n training points
  - Each distance calculation involves d features
  - Sort or partially sort distances to find K nearest neighbors
- This becomes expensive for large datasets or real-time applications

### Example: Prediction Time
Let's measure the prediction time for our model:

In [None]:
import time

# Measure training time
start = time.time()
quick_knn = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(n_neighbors=chosen_k, metric=chosen_metric))
])
quick_knn.fit(X_train_all, y_train_all)
train_time = time.time() - start

# Measure prediction time for single sample
start = time.time()
_ = quick_knn.predict(X_test[:1])
single_pred_time = time.time() - start

# Measure prediction time for all test samples
start = time.time()
_ = quick_knn.predict(X_test)
batch_pred_time = time.time() - start

print("="*60)
print("COMPUTATIONAL PERFORMANCE")
print("="*60)
print(f"\nTraining Set Size: {len(X_train_all)} samples, {X_train_all.shape[1]} features")
print(f"Training Time: {train_time*1000:.2f} ms (essentially instant)")
print(f"\nSingle Prediction Time: {single_pred_time*1000:.2f} ms")
print(f"Batch Prediction ({len(X_test)} samples): {batch_pred_time*1000:.2f} ms")
print(f"Average Prediction Time: {batch_pred_time/len(X_test)*1000:.4f} ms per sample")
print(f"\nThroughput: {len(X_test)/batch_pred_time:.0f} predictions/second")
print("="*60)

print("\n💡 Key Takeaway:")
print("   Training is instant, but prediction scales linearly with training set size.")
print("   For 1M+ training samples, consider approximate nearest neighbor methods.")

## Limitations and Advanced Topics

### Current Scope Limitations

**1. Single Hold-Out Validation**  
This notebook uses a **single train/validation/test split**, which is simple but has drawbacks:
- Performance estimates depend on the specific random split
- Small datasets may have high variance in estimates
- We might get "lucky" or "unlucky" with our particular validation set

**Better Approach: K-Fold Cross-Validation**  
Instead of one validation set, k-fold CV:
- Splits training data into k folds (typically k=5 or k=10)
- Trains k models, each using k-1 folds for training and 1 fold for validation
- Averages performance across all k folds for more robust estimates
- Reduces dependence on any single data split

Example code structure:
```python
from sklearn.model_selection import cross_val_score

pipe = Pipeline([('scaler', StandardScaler()), 
                 ('knn', KNeighborsRegressor(n_neighbors=7))])
scores = cross_val_score(pipe, X_train, y_train, cv=5, 
                         scoring='neg_root_mean_squared_error')
print(f"CV RMSE: {-scores.mean():.3f} (+/- {scores.std():.3f})")
```

**2. Brute-Force Neighbor Search**  
We used sklearn's default brute-force algorithm, which works well for small/medium datasets but doesn't scale. For large datasets, consider:
- **KD-Trees**: Efficient for low-dimensional data (d < 20)
- **Ball Trees**: Better for higher dimensions than KD-trees
- **Approximate methods**: FAISS, Annoy, HNSW for millions of points

**3. Additional Considerations**
- **Distance-weighted KNN**: Use `weights='distance'` to give closer neighbors more influence
- **Outlier handling**: KNN averages can be affected by extreme target values
- **Feature engineering**: Interaction or polynomial features might improve performance
- **Ensemble methods**: Combine KNN with other models for better robustness
- **Curse of dimensionality**: KNN performance degrades in very high dimensions (d > 50)

## Conclusion and Key Takeaways

### What We Learned

1. **Feature scaling is non-negotiable** for KNN. Without it, features with larger ranges dominate distance calculations and render other features useless.

2. **Hyperparameter tuning requires careful validation.** Using train/validation/test splits (or cross-validation) prevents overfitting to a specific data subset.

3. **K controls the bias-variance tradeoff.** Small K leads to high variance (overfitting), large K leads to high bias (underfitting). Tune K systematically using validation data.

4. **Multiple evaluation metrics** (RMSE, MAE, R², MAPE, percentage errors) provide different perspectives on model performance and help identify specific weaknesses.

5. **Error patterns reveal model limitations.** High ambiguity regions (neighbors with diverse targets) produce larger errors—understanding where KNN struggles is as important as measuring overall performance.

### When to Use KNN Regression

**✅ KNN is a Good Choice When:**

- **Small to medium datasets** (< 100K samples) where prediction latency isn't critical
- **Non-linear relationships** exist between features and target (KNN makes no linearity assumptions)
- **You need an interpretable baseline** before trying complex models
- **Feature interactions are complex** and difficult to engineer explicitly
- **You want quick prototyping** without extensive hyperparameter tuning
- **Local patterns matter** more than global trends

**❌ Avoid KNN When:**

- **Large datasets** (millions of rows) where prediction time becomes prohibitive
- **Real-time predictions** are required with strict latency requirements (< 10ms)
- **High-dimensional data** (d > 50) due to the curse of dimensionality
- **Features have no natural distance metric** (e.g., categorical data, text)
- **Interpretability of individual predictions** is critical (KNN doesn't explain why it made a specific prediction beyond "these are the nearest neighbors")
- **Memory constraints exist** (KNN stores entire training set)

### Comparison with Other Regression Models

| Model | Training Speed | Prediction Speed | Interpretability | Handles Non-linearity | Scales to Large Data |
|-------|---------------|------------------|------------------|----------------------|----------------------|
| **KNN** | Instant | Slow | Medium | Yes | No |
| **Linear Regression** | Fast | Fast | High | No | Yes |
| **Random Forest** | Slow | Fast | Medium | Yes | Yes |
| **Gradient Boosting** | Very Slow | Fast | Low | Yes | Yes |
| **Neural Networks** | Very Slow | Fast | Very Low | Yes | Yes |

### Next Steps

KNN regression is an excellent **baseline model** that provides intuition about distance-based prediction. After establishing this baseline:

1. Try **tree-based ensembles** (Random Forest, XGBoost) for better performance on most tabular data
2. Experiment with **feature engineering** to capture domain knowledge
3. Use **cross-validation** for more robust hyperparameter selection
4. Consider **model stacking** that combines KNN with other regressors
5. For production systems with large data, explore **approximate nearest neighbor** methods

**Remember:** Simple models like KNN often provide surprising performance and valuable insights—don't immediately jump to complex deep learning without establishing a solid baseline first.