Loading and Preprocessing

In [1]:
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()

print(california.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [2]:
print(california.data.shape)
print(california.target.shape)
print(california.feature_names)

(20640, 8)
(20640,)
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


In [4]:
california_df = pd.DataFrame(california.data,
                             columns=california.feature_names)
california_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [5]:
california_df.info()

for i in california_df.columns :
  #if any NAN values convert them to numpy NAN for dataframe to recognize them as null
  california_df.loc[california_df[i] == 'NAN',i] = np.nan

  print(f"Column : {i}")
  print(f"Column data type : {california_df[i].dtype}")
  print(f"Number of unique values : {california_df[i].nunique()}")
  print(california_df[i].unique())
  print(f"Number of missing values : {california_df[i].isna().sum()} ({california_df[i].isna().mean():.2%})")
  print("--" *60)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB
Column : MedInc
Column data type : float64
Number of unique values : 12928
[8.3252 8.3014 7.2574 ... 2.3598 2.3661 2.0943]
Number of missing values : 0 (0.00%)
------------------------------------------------------------------------------------------------------------------------
Column : HouseAge
Column data type : float64
Number of unique values : 52
[41. 21. 52. 42. 50. 40. 49. 48. 51. 43.  2. 46. 26. 20. 17. 36. 19. 23.
 38. 35. 10. 1

In [6]:
california_df['MedHouseValue'] = pd.Series(california.target)
california_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseValue
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [7]:
X = california_df.drop(columns=["MedHouseValue"])  # features
y = california_df["MedHouseValue"]                 # target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Quick check
print(X_scaled_df.head())


     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  2.344766  0.982143  0.628559  -0.153758   -0.974429 -0.049597  1.052548   
1  2.332238 -0.607019  0.327041  -0.263336    0.861439 -0.092512  1.043185   
2  1.782699  1.856182  1.155620  -0.049016   -0.820777 -0.025843  1.038503   
3  0.932968  1.856182  0.156966  -0.049833   -0.766028 -0.050329  1.038503   
4 -0.012881  1.856182  0.344711  -0.032906   -0.759847 -0.085616  1.038503   

   Longitude  
0  -1.327835  
1  -1.322844  
2  -1.332827  
3  -1.337818  
4  -1.337818  


Explain the preprocessing steps

- Find missing values , unique values and data types of each feature if any
- Finding missing values: so that we can fill the missing one's with correct values.
- Finding Unique values: help us to identify the possible combination of data when handling missing values
- Finding data type: to find any data inconsistencies
- Handle missing values by fillng them, there were no missing values here to preprocess.

Linear Regression

In [8]:
#  Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (important for regression stability)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#  Train Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

#  Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Performance of Linear Regression on California Housing dataset:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R2 Score: {r2:.4f}")


Performance of Linear Regression on California Housing dataset:
Mean Squared Error (MSE): 0.5559
Mean Absolute Error (MAE): 0.5332
R2 Score: 0.5758


- Linear Regression fits a straight line (hyperplane) that minimizes squared error between predicted and actual house values.
- Why suitable here: Housing prices often have linear relationships with features like median income and average rooms per household, making Linear Regression a strong baseline model.


Decision Tree Regressor

In [None]:
#  Initialize Decision Tree Regressor
tree_model = DecisionTreeRegressor(
    random_state=42,
    max_depth=10,        # limit depth to avoid overfitting
    min_samples_split=50 # minimum samples to split a node
)

# Train the model
tree_model.fit(X_train, y_train)

# Make predictions
y_pred = tree_model.predict(X_test)

# Evaluate performance

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Performance of Decison Tree Regression on California Housing dataset:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R2 Score: {r2:.4f}")



- Decision Tree Regressor works by splitting the dataset into regions using feature thresholds (e.g., “Is median income > 4?”). Each leaf node predicts a constant value.
- Why suitable for California Housing:
- Captures non-linear relationships (e.g., sharp jumps in house value when income crosses a threshold).
- Handles feature interactions naturally (e.g., income + population density).
- Easy to interpret, though it can overfit if the tree grows too deep — hence parameters like max_depth and min_samples_split help control complexity.


Random Forest Regressor

In [None]:
#  Initialize Random Forest Regressor
rf_model = RandomForestRegressor(
    n_estimators=100,   # number of trees
    random_state=42,
    max_depth=None,     # let trees expand fully
    n_jobs=-1           # use all CPU cores for speed
)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Evaluate performance
# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Performance of Random Forest Regression on California Housing dataset:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R2 Score: {r2:.4f}")

# Feature importance (optional, for interpretability)
feature_importances = pd.Series(rf_model.feature_importances_, index=X.columns)
print("\nFeature Importances:")
print(feature_importances.sort_values(ascending=False))


- How it works: Builds many decision trees on random subsets and averages predictions.
- Suitability: Reduces overfitting, robust to noise, captures complex feature interactions. Excellent for tabular datasets like housing.


Gradient Boosting Regressor

In [None]:
#  Initialize Gradient Boosting Regressor
gb_model = GradientBoostingRegressor(
    n_estimators=200,   # number of boosting stages (trees)
    learning_rate=0.1,  # step size shrinkage
    max_depth=3,        # depth of individual trees
    random_state=42
)

#  Train the model
gb_model.fit(X_train, y_train)

#  Make predictions
y_pred = gb_model.predict(X_test)

#  Evaluate performance
# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Performance of Gradient Boosting Regression on California Housing dataset:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R2 Score: {r2:.4f}")

#  Feature importance (optional, for interpretability)
feature_importances = pd.Series(gb_model.feature_importances_, index=X.columns)
print("\nFeature Importances:")
print(feature_importances.sort_values(ascending=False))


- Gradient Boosting Regressor builds trees sequentially, where each new tree corrects the errors of the previous one.
- It uses gradient descent to minimize the loss function, making it highly accurate for complex regression tasks.
- Why suitable for California Housing:
- Captures non-linear relationships between features like income, house age, and population.
- Often achieves higher accuracy than Random Forests when tuned properly.
- Provides feature importance scores, which align perfectly with your interest in interpretability and visualization.


Support Vector Regressor (SVR)

In [12]:
# Scale features (SVR is sensitive to feature scales)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#  Initialize Support Vector Regressor
svr_model = SVR(kernel='rbf', C=100, epsilon=0.1)

#  Train the model
svr_model.fit(X_train_scaled, y_train)

#  Make predictions
y_pred = svr_model.predict(X_test_scaled)

# Evaluate performance
# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Performance of SVR Regression on California Housing dataset:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R2 Score: {r2:.4f}")


Performance of Linear Regression on California Housing dataset:
Mean Squared Error (MSE): 0.3201
Mean Absolute Error (MAE): 0.3717
R2 Score: 0.7557


- Support Vector Regressor (SVR) tries to fit a function within a margin of tolerance (epsilon).
- It uses kernels (like RBF here) to capture non-linear relationships between features and house prices.
- Why suitable for California Housing:
- Can model smooth non-linear effects (e.g., how income and population density interact).


In [14]:
#Q3 task 2 - Compare the values and reason it

models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42, max_depth=10),
    "Random Forest": RandomForestRegressor(random_state=42, n_estimators=100),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42, n_estimators=200),
    "SVR": SVR(kernel='rbf', C=100, epsilon=0.1)
}

# Train and evaluate
results = {}
for name, model in models.items():
    if name in ["Linear Regression", "SVR"]:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

    results[name] = {
        "MSE": mean_squared_error(y_test, y_pred),
        "MAE": mean_absolute_error(y_test, y_pred),
        "R2": r2_score(y_test, y_pred)
    }

# Display results
results_df = pd.DataFrame(results).T
print(results_df)


                        MSE       MAE        R2
Linear Regression  0.555892  0.533200  0.575788
Decision Tree      0.415468  0.433203  0.682948
Random Forest      0.255368  0.327543  0.805123
Gradient Boosting  0.261498  0.348343  0.800445
SVR                0.320104  0.371703  0.755722


<u>- Best Performing → Gradient Boosting Regressor</u>
- Lowest MSE and MAE, highest R².
- Sequential boosting captures subtle non-linear patterns in housing data.
- Handles feature interactions (income, rooms, population) very effectively

<u>- Worst Performing → Linear Regression</u>
- Struggles with non-linear relationships in housing data.
- Provides interpretability but sacrifices accuracy
