# Machine Learning Tutorial: California Housing Regression

This notebook demonstrates a complete machine learning workflow for regression using the California Housing dataset.

## Objectives
- Load and explore the California Housing dataset
- Visualize the data to understand patterns and relationships
- Build and compare multiple regression models
- Evaluate model performance using multiple metrics
- Make predictions on new data

## 1. Import Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from lightgbm import LGBMRegressor

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")

## 2. Load and Explore the Dataset

The California Housing dataset contains information from the 1990 California census. It contains 20,640 samples with 8 features:

**Features:**
- MedInc: Median income in block group
- HouseAge: Median house age in block group
- AveRooms: Average number of rooms per household
- AveBedrms: Average number of bedrooms per household
- Population: Block group population
- AveOccup: Average number of household members
- Latitude: Block group latitude
- Longitude: Block group longitude

**Target:**
- MedHouseVal: Median house value (in $100,000s)

In [None]:
# Load the dataset
housing = fetch_california_housing()

# Create a DataFrame for easier manipulation
df = pd.DataFrame(data=housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Basic statistics
df.describe()

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Check data types
print("\nData types:")
print(df.dtypes)

## 3. Data Visualization

In [None]:
# Distribution of the target variable
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df['MedHouseVal'], bins=50, color='#4ECDC4', edgecolor='black')
axes[0].set_xlabel('Median House Value ($100,000s)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of House Values')
axes[0].axvline(df['MedHouseVal'].mean(), color='red', linestyle='--', label=f'Mean: ${df["MedHouseVal"].mean():.2f}')
axes[0].axvline(df['MedHouseVal'].median(), color='orange', linestyle='--', label=f'Median: ${df["MedHouseVal"].median():.2f}')
axes[0].legend()

# Box plot
axes[1].boxplot(df['MedHouseVal'], vert=True)
axes[1].set_ylabel('Median House Value ($100,000s)')
axes[1].set_title('Box Plot of House Values')

plt.tight_layout()
plt.show()

print(f"Target variable statistics:")
print(f"Mean: ${df['MedHouseVal'].mean():.2f}")
print(f"Median: ${df['MedHouseVal'].median():.2f}")
print(f"Std Dev: ${df['MedHouseVal'].std():.2f}")
print(f"Min: ${df['MedHouseVal'].min():.2f}")
print(f"Max: ${df['MedHouseVal'].max():.2f}")

In [None]:
# Correlation heatmap
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, fmt='.2f', cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Heatmap', fontsize=14)
plt.tight_layout()
plt.show()

# Print correlations with target
print("\nCorrelations with target (MedHouseVal):")
target_corr = correlation_matrix['MedHouseVal'].sort_values(ascending=False)
print(target_corr)

In [None]:
# Scatter plots of top correlated features vs target
top_features = correlation_matrix['MedHouseVal'].abs().sort_values(ascending=False)[1:5].index

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Top Features vs Median House Value', fontsize=16)

for idx, feature in enumerate(top_features):
    ax = axes[idx // 2, idx % 2]
    ax.scatter(df[feature], df['MedHouseVal'], alpha=0.3, s=10, color='#4ECDC4')
    ax.set_xlabel(feature)
    ax.set_ylabel('Median House Value ($100,000s)')
    ax.set_title(f'{feature} vs House Value (r={correlation_matrix.loc[feature, "MedHouseVal"]:.3f})')
    
    # Add trend line
    z = np.polyfit(df[feature], df['MedHouseVal'], 1)
    p = np.poly1d(z)
    ax.plot(df[feature], p(df[feature]), "r--", alpha=0.8, linewidth=2)

plt.tight_layout()
plt.show()

In [None]:
# Geographic visualization
plt.figure(figsize=(12, 8))
scatter = plt.scatter(df['Longitude'], df['Latitude'], 
                     c=df['MedHouseVal'], cmap='viridis', 
                     alpha=0.4, s=10)
plt.colorbar(scatter, label='Median House Value ($100,000s)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('California Housing Prices by Location')
plt.tight_layout()
plt.show()

In [None]:
# Distribution of all features
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
fig.suptitle('Distribution of All Features', fontsize=16)

for idx, column in enumerate(df.columns):
    ax = axes[idx // 3, idx % 3]
    ax.hist(df[column], bins=50, color='#45B7D1', edgecolor='black', alpha=0.7)
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')
    ax.set_title(column)

plt.tight_layout()
plt.show()

## 4. Prepare Data for Modeling

In [None]:
# Split features and target
X = housing.data
y = housing.target

# Split into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]:,} samples")
print(f"Test set size: {X_test.shape[0]:,} samples")
print(f"Number of features: {X_train.shape[1]}")

# Standardize features (important for Linear and Ridge Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nData prepared and standardized successfully!")

## 5. Train Multiple Models

We'll train and compare five different regression algorithms:
1. Linear Regression (baseline)
2. Ridge Regression (regularized linear model)
3. Decision Tree
4. Random Forest
5. LightGBM

In [None]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0, random_state=42),
    'Decision Tree': DecisionTreeRegressor(random_state=42, max_depth=10),
    'Random Forest': RandomForestRegressor(random_state=42, n_estimators=100, max_depth=15, n_jobs=-1),
    'LightGBM': LGBMRegressor(random_state=42, n_estimators=100, verbose=-1)
}

# Train models and store results
results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Use scaled data for linear models, original for tree-based models
    if name in ['Linear Regression', 'Ridge Regression']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        # Cross-validation score (negative MSE, we'll convert to positive)
        cv_scores = -cross_val_score(model, X_train_scaled, y_train, 
                                     cv=5, scoring='neg_mean_squared_error')
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        # Cross-validation score
        cv_scores = -cross_val_score(model, X_train, y_train, 
                                     cv=5, scoring='neg_mean_squared_error')
    
    # Calculate metrics
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    
    # Store results
    results[name] = {
        'model': model,
        'predictions': y_pred,
        'r2': r2,
        'mae': mae,
        'mse': mse,
        'rmse': rmse,
        'cv_rmse_mean': np.sqrt(cv_scores.mean()),
        'cv_rmse_std': np.sqrt(cv_scores.std())
    }
    
    print(f"R² Score: {r2:.4f}")
    print(f"MAE: ${mae:.4f} (${mae*100:.2f}k)")
    print(f"RMSE: ${rmse:.4f} (${rmse*100:.2f}k)")
    print(f"Cross-validation RMSE: ${np.sqrt(cv_scores.mean()):.4f} (+/- ${np.sqrt(cv_scores.std()):.4f})")

## 6. Model Evaluation and Comparison

In [None]:
# Compare model performance
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'R² Score': [results[m]['r2'] for m in results.keys()],
    'MAE ($100k)': [results[m]['mae'] for m in results.keys()],
    'RMSE ($100k)': [results[m]['rmse'] for m in results.keys()],
    'CV RMSE Mean': [results[m]['cv_rmse_mean'] for m in results.keys()]
})

comparison_df = comparison_df.sort_values('R² Score', ascending=False).reset_index(drop=True)
print("Model Comparison:")
print(comparison_df.to_string(index=False))

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# R² Score
axes[0].barh(comparison_df['Model'], comparison_df['R² Score'], color='#4ECDC4')
axes[0].set_xlabel('R² Score')
axes[0].set_title('Model R² Scores (Higher is Better)')
axes[0].set_xlim([0, 1])
for idx, (model, score) in enumerate(zip(comparison_df['Model'], comparison_df['R² Score'])):
    axes[0].text(score + 0.01, idx, f'{score:.4f}', va='center')

# MAE
axes[1].barh(comparison_df['Model'], comparison_df['MAE ($100k)'], color='#45B7D1')
axes[1].set_xlabel('MAE ($100,000s)')
axes[1].set_title('Mean Absolute Error (Lower is Better)')
for idx, (model, mae) in enumerate(zip(comparison_df['Model'], comparison_df['MAE ($100k)'])):
    axes[1].text(mae + 0.01, idx, f'{mae:.4f}', va='center')

# RMSE
axes[2].barh(comparison_df['Model'], comparison_df['RMSE ($100k)'], color='#FF6B6B')
axes[2].set_xlabel('RMSE ($100,000s)')
axes[2].set_title('Root Mean Squared Error (Lower is Better)')
for idx, (model, rmse) in enumerate(zip(comparison_df['Model'], comparison_df['RMSE ($100k)'])):
    axes[2].text(rmse + 0.01, idx, f'{rmse:.4f}', va='center')

plt.tight_layout()
plt.show()

In [None]:
# Predicted vs Actual for all models
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Predicted vs Actual House Values', fontsize=16)

for idx, (name, result) in enumerate(results.items()):
    ax = axes[idx // 3, idx % 3]
    
    # Scatter plot
    ax.scatter(y_test, result['predictions'], alpha=0.3, s=10, color='#4ECDC4')
    
    # Perfect prediction line
    min_val = min(y_test.min(), result['predictions'].min())
    max_val = max(y_test.max(), result['predictions'].max())
    ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')
    
    ax.set_xlabel('Actual Values ($100,000s)')
    ax.set_ylabel('Predicted Values ($100,000s)')
    ax.set_title(f"{name}\nR²={result['r2']:.4f}, RMSE={result['rmse']:.4f}")
    ax.legend()

# Remove the extra subplot
fig.delaxes(axes[1, 2])

plt.tight_layout()
plt.show()

In [None]:
# Residual analysis for the best model
best_model_name = comparison_df.iloc[0]['Model']
best_predictions = results[best_model_name]['predictions']
residuals = y_test - best_predictions

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle(f'Residual Analysis - {best_model_name}', fontsize=14)

# Residual plot
axes[0].scatter(best_predictions, residuals, alpha=0.3, s=10, color='#4ECDC4')
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_xlabel('Predicted Values ($100,000s)')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residual Plot')

# Residual distribution
axes[1].hist(residuals, bins=50, color='#45B7D1', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residuals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)

plt.tight_layout()
plt.show()

print(f"Residual Statistics:")
print(f"Mean: {residuals.mean():.4f}")
print(f"Std Dev: {residuals.std():.4f}")
print(f"Min: {residuals.min():.4f}")
print(f"Max: {residuals.max():.4f}")

In [None]:
# Feature importance for tree-based models
tree_based_models = ['Decision Tree', 'Random Forest', 'LightGBM']

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Feature Importance (Tree-Based Models)', fontsize=16)

for idx, model_name in enumerate(tree_based_models):
    if model_name in results:
        model = results[model_name]['model']
        importances = model.feature_importances_
        
        # Create DataFrame for easier plotting
        importance_df = pd.DataFrame({
            'Feature': housing.feature_names,
            'Importance': importances
        }).sort_values('Importance', ascending=True)
        
        axes[idx].barh(importance_df['Feature'], importance_df['Importance'], color='#4ECDC4')
        axes[idx].set_xlabel('Importance')
        axes[idx].set_title(model_name)

plt.tight_layout()
plt.show()

## 7. Make Predictions on New Data

Let's use our best model to make predictions on hypothetical new houses.

In [None]:
# Create some example new data
# Format: [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude]
new_samples = np.array([
    [8.3, 41.0, 6.98, 1.02, 322.0, 2.55, 37.88, -122.23],  # High-value area (SF)
    [3.2, 20.0, 5.00, 1.00, 1500.0, 3.00, 34.05, -118.24],  # Mid-value area (LA)
    [2.5, 15.0, 4.50, 1.20, 800.0, 2.80, 36.75, -119.77],   # Lower-value rural area
])

# Feature names for reference
feature_info = [
    "High income, older house, spacious, SF Bay Area",
    "Medium income, newer house, average size, LA",
    "Lower income, new house, smaller, Central Valley"
]

# Get the best model
best_model = results[best_model_name]['model']

# Make predictions (scale if using linear models)
if best_model_name in ['Linear Regression', 'Ridge Regression']:
    new_samples_scaled = scaler.transform(new_samples)
    predictions = best_model.predict(new_samples_scaled)
else:
    predictions = best_model.predict(new_samples)

# Display predictions
print(f"Predictions using {best_model_name}:\n")
print("="*80)

for i, (sample, pred, info) in enumerate(zip(new_samples, predictions, feature_info)):
    print(f"\nSample {i+1}: {info}")
    print(f"  Features:")
    for feat_name, feat_val in zip(housing.feature_names, sample):
        print(f"    {feat_name}: {feat_val}")
    print(f"  Predicted House Value: ${pred:.4f} x 100k = ${pred*100:.2f}k")
    print("  " + "-"*76)

print("\n" + "="*80)

## Summary

In this tutorial, we:

1. ✅ Loaded and explored the California Housing dataset (20,640 samples, 8 features)
2. ✅ Visualized the data using histograms, correlation heatmaps, scatter plots, and geographic maps
3. ✅ Prepared the data by splitting and scaling
4. ✅ Trained five different regression models
5. ✅ Evaluated and compared model performance using R², MAE, RMSE, and cross-validation
6. ✅ Analyzed residuals and feature importance
7. ✅ Made predictions on new data

**Key Findings:**
- **MedInc** (median income) is the strongest predictor of house values
- Geographic features (Latitude, Longitude) are also important predictors
- Tree-based ensemble models (Random Forest, LightGBM) generally outperform linear models
- The target variable has some outliers at the high end (houses capped at $500k)

**Understanding Regression Metrics:**
- **R² Score**: Proportion of variance explained (0-1, higher is better)
  - 0.8+ = Excellent, 0.6-0.8 = Good, 0.4-0.6 = Moderate, <0.4 = Poor
- **MAE (Mean Absolute Error)**: Average absolute difference between predicted and actual
  - Easy to interpret: "On average, our predictions are off by $X"
- **RMSE (Root Mean Squared Error)**: Square root of average squared errors
  - Penalizes large errors more heavily than MAE
  - Same units as target variable

**Next Steps:**
- Try feature engineering (e.g., rooms per capita, bedroom ratio)
- Experiment with hyperparameter tuning using `GridSearchCV` or `RandomizedSearchCV`
- Handle outliers and cap values more carefully
- Try advanced models like XGBoost or neural networks
- Apply these techniques to other regression datasets